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STATUS  OF  PROJECT 

Objective  1  (register-level  design  of  MCAP) :  The  register-level 
design  of  the  MCAP  has  been  completed  for  all  component  types 
except  the  instruction  component,  which  has  been  partially 
designed.  The  component  designs  are  currently  being  document¬ 
ed  by  Glenn  Gibson  and  an  undergraduate  assistant.  This  design 
effort  has  resulted  in  a  masters-level  thesis. 

Objective  2  (architecture/algorithm  case  studies) :  This  study 
has  progressed  slowly  because  of  the  incompleteness  of  the 
simulator.  However,  the  simulator  has  just  been  completed  and 
considerable  progress  on  this  objective  is  expected  in  the  next 
six  months.  Preliminary  work  has  been  done  by  assuming  all 
data  needed  by  an  algorithm  is  already  in  the  MCAP.  Also, 
Sergio  Cabrera  and  his  post-doctoral  assistant  have  investigat¬ 
ed  which  algorithms  should  be  studied  in  the  area  of  signal  and 
image  processing  and  Yi-Chieh  Chang  and  a  masters-level  student 
have  simulated  some  matrix  operations.  An  undergraduate  as¬ 
sistant  and  a  doctoral  student  have  just  started  work  in  this 
area  and  a  masters-level  student  is  to  be  added  in  the  near 
future.  Chang,  Gibson  and  a  masters-level  student  have 
produced  a  conference  paper  on  matching  matrix  multiplication 
to  an  MCAP  architecture. 

Objective  3  (two  memory  controller  designs) :  Yu-Cheng  Liu  and 
a  masters-level  student  began  their  work  on  comparing  two 
memory  controller  designs  in  August  when  our  new  workstations 
arrived  and  the  Mentor  Graphics  design  software  was  installed 
on  them.  This  work  will  continue  throughout  the  coming  year. 
A  related  study  by  Gibson,  Liu  and  Chang  of  memory  hierarchies 
and  computational  intensity  has  resulted  in  a  conference  paper. 

Objective  4  (technology  evaluations) :  Considerable  work  has 
been  done  in  this  area.  This  work  has  mainly  been  concerned 
with  implementing  an  MCAP  on  a  multichip  module  and  has  been 
carried  out  by  Vi jay  Singh  and  two  masters-level  students  with 
some  assistance  by  Gibson.  Some  investigation  of  a  wafer-scale 
implementation  has  been  done  by  Chang.  This  work  has  produced 
a  master's  thesis  and  two  conference  papers.  Another  paper  has 
been  submitted  to  a  journal. 
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Objective  5  (simulator  development) :  Except  for  updating  the 
architecture  editor,  the  simulator  software  package  has  been 
completed.  However,  because  it  is  difficult  to  program  an  MCAP 
using  the  current  assembler,  a  preprocessor  is  being  programmed 
by  Chang  and  an  undergraduate  assistant.  Gibson  and  an  under¬ 
graduate  assistant  will  continue  testing  and  documenting  this 
software.  This  work  has  produced  two  theses  and  a  third  is 
currently  being  written.  A  conference  paper  is  being  written 
and  will  be  submitted  before  October  15.  An  expanded  version 
of  this  paper  will  be  submitted  to  a  journal. 


2.  MASTERS-LEVEL  THESES  PRODUCED 

Ernesto  Castro-Gomez ,  "Assembler  Design  and  Algorithm  Imple¬ 
mentation  on  a  Modular ly  Configured  Attached  Processor,"  The¬ 
sis,  University  of  Texas  at  El  Paso,  1994. 

Alejandro  Brito,  "A  Graphics  Editor  for  the  MCAP  Simulator," 
Thesis,  University  of  Texas  at  El  Paso,  1994. 

San  jay  Singh,  "A  Comparative  Evaluation  of  Implementing  a  Novel 
Modular  ly  Configured  Attached  Processor  Architecture,"  Thesis, 
University  of  Texas  at  El  Paso,  1994. 

Michael  Flahie,  "Fast  N-bit  Multipliers  Using  Cascaded  Half 
Adders,"  Thesis,  University  of  Texas  at  El  Paso,  1994. 

Stephen  Synesyzn,  "Simulation  of  a  Memory  Controller  for  a  Mod¬ 
ular  ly  Configured  Attached  Processor,"  Thesis,  University  of 
Texas  at  El  Paso,  currently  being  written. 


3.  PAPERS  ACCEPTED  FOR  PUBLICATION  (see  attachments) 

G.  A.  Gibson,  Vijay  Singh,  Sanjay  Singh,  Y.  C.  Liu,  Y.  C.  Chang 
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ed  Attached  Processors,"  IEEE  International  Computer  Symposium, 
Taiwan,  Dec.,  1994. 
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Sanjay  Singh,  B.W.  Gremel,  Vijay  Singh  and  G.  A.  Gibson, 
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ured  Attached  Processor,"  submitted  to  Int'l  J.  of  Electronics. 
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Abstract 

A  new  architecture  for  high-performance  paratlel  attached 
processors  is  described  in  this  paper.  Based  on  this  archi¬ 
tecture,  an  attached  processor  can  be  implemented  as  mul¬ 
tiple  memory- to- memory  pipelines,  each  being  constructed 
with  a  class  of  fundamental  components.  The  unique  fea¬ 
tures  are  that  the  attached  processor  can  be  configured  to 
match  a  eet  of  algorithms  and  its  memory  controllers  can 
be  programmed  to  fit  the  access  patterns  required  by  the 
algorithms.  As  a  result,  high  utilisation  of  the  process¬ 
ing  logic  for  given  sets  of  algorithms  can  be  obtained.  An 
example  based  on  matrix  multiplication  is  used  for  ii/ue(ra> 
<ion.  Finally,  design  issues  related  to  the  implementation 
of  the  attached  processor  based  on  an  MCM  technology  are 
discussed.  * 

1  Introduction 

An  attached,  oi  back-end,  processor  is  a  processing  sys¬ 
tem  that  is  connected  to  a  host  computer  for  the  purpose  of 
very  quickly  executing  most  of  the  overall  system’s  compu¬ 
tational  tasks.  Typical  early  attached  processors  were  the 
AP-120B  and  FPS-164  made  by  Floating  Point  Systems, 
Inc.,  the  IBM  3838,  and  the  MATP  made  by  Datawest, 
Inc.  (Ij,  [2],  [3j.  These  attached  processors  all  have  their 
own  data  memories  and  transfer  data  between  these  mem¬ 
ories  and  the  main  memories  of  their  hosts  using  DMA 
data  channels.  They  also  include  their  own  code  memories 
where  subprograms  may  be  permanently  stored  or  down¬ 
loaded  from  their  hosts.  These  subprograms  are  initiated 
by  commands  from  the  host  and  supervise  the  data  flows 
from  the  attached  processor’s  data  memories,  through  the 
attached  processor’s  processing  elements,  and  back  into 
the  data  memories. 

Although  the  early  attached  processors  included  limited 
multiprocessing,  the  more  recently  implemented  process- 

>The  work  reported  in  this  paper  was  supported  in  part  by 
the  Office  of  Naval  Research  under  Grant  No.  N00014-93-i- 
1343.  Any  opinioM,  findinp,  and  conclusions  or  recommenda¬ 
tions  expressed  in  this  paper  arc  those  of  the  authors  and  do 
not  neceasarOy  reflect  the  view  of  the  funding  agency. 


ing  arrays  are  also  controlled  by  a  host  (e  g  ,  the  PAX  com¬ 
puter  [4])  and  ate  designed  to  perform  most  of  the  overall 
system’s  computational  tasks.  Therefore,  these  arrays  and 
even  the  array  processing  portions  of  today’s  supercom¬ 
puters,  such  as  the  Cray  series  [1],  [3]  could  be  interpreted 
as  attached  processors. 

The  specific  purpose  of  an  attached  processor  is  to  execute 
members  of  a  set  of  algorithms  very  quickly.  The  broatder 
the  set  of  algorithms  the  more  generally  applicable  the  at¬ 
tacked  processor.  The  underlying  goal  of  the  designer  is 
to  efficiently  utilize  the  hardware  for  as  broad  a  set  of  al¬ 
gorithms  as  possible.  However,  for  most  current  designs, 
the  average  sustainable  execution  rates  have  been  found 
to  be  only  5%  to  20%  of  their  peak  rates,  which  are  de¬ 
termined  by  summing  the  maximum  computational  rates 
of  the  processing  elements.  For  example,  the  sustainable 
rate  for  a  Cray  X-MP  with  four  processors  may  be  as  low 
as  5%  for  some  algorithms  [S].  Also  extensive  evaluations 
of  recent  high-performance  computations  using  Lapack  are 
given  in  [6]  and  using  NSA  parallel  benchmarks  are  given 
in  [7].  Although  some  of  the  lost  efficiency  is  necessitated 
by  the  algorithms,  much  of  it  is  due  to  memory  accessing 
and  contention  for  shared  resources  in  general,  including 
internal  buses. 

Described  in  this  paper  is  a  class  of  high-performance  at¬ 
tached  processors  called  Modularly  Configurable  Attached 
Processors  (MCAPs)  which  can  attain  quickness  and  high 
utilization  through:  (1)  Closely  matching  their  architec¬ 
tures  to  the  set  of  algorithms  they  are  to  execute.  (2) 
Overlapping  of  processing  and  memory  accessing  by  us¬ 
ing  memory  prefetching.  (3)  Minimizing  the  movement  of 
data.  (4)  Using  a  high-speed  technology  with  MCM  or 
wafer  scale  implementations. 

An  MCAP  is  constructed  from  the  component  types  spec¬ 
ified  in  Sec.  2.  These  component  types  ate  such  that 
each  member  of  the  class  may  include  parallel  process¬ 
ing,  memory-to-memory  pipelines,  and  be  constructed  in 
a  building  block  fashion.  They  encompass  routing  com¬ 
ponents  (including  buses)  as  well  as  memory,  control,  and 
processing  components.  By  overlapping  processing  with 


merooiy  accessing  uid  ni4lchiiig  ui  Architecture  with  a  set 
of  Algorithms,  it  is  predicted  thst  the  AverAge  sustAinAbte 
rAte  for  a  specific  set  of  Algorithms  cau  Attsin  At  IcAst  60% 
of  the  peAk  tAte.  By  dehtiing  components  thst  Ate  sim¬ 
ple  enough  to  be  febricAted  onto  single  low-density  ICs,  a 
high-speed  technology  mny  be  used. 

Much  of  An  MCAP's  efficiency  is  gsined  by  distributing  the 
instructions  for  the  next  Algorithm  (or  Algorithm  phsse) 
to  the  VArious  components  while  the  current  Algorithm  (or 
phsse)  is  executing.  Once  the  Algorithm  begins,  these  in- 
strnctions  dictAte  the  modes,  routing  pstterns,  ptefetch- 
ing  pstterns,  And  so  on  of  the  components  receiving  them. 
After  An  Algorithm  stsrts,  esch  component  operstes  more 
or  less  on  its  own  except  for  responding  to  its  handshsk- 
ing  signals.  Efficiency  is  further  enhanced  by  prefetching 
operands  from  the  memoiy  subsystems.  Prefetching  using 
program.med  patterns  avoids  the  misses  that  result  from 
using  ordinary  caches. 

Section  2  describes  the  architecture  of  the  MCAP  and  the 
fundamental  components  required  to  construct  an  MCAP. 
Section  3  illustrates  how  to  match  an  algorithm  with  a 
given  MCAP  architecture  in  order  to  attain  a  high  sustain¬ 
able  ratss.  A  major  issue  related  to  the  implementation 
of  MCAPs  is  the  choice  of  semiconductor  technology  and 
packaging,  which  affect  speed,  gate  density,  power  dissi¬ 
pation,  and  cost.  The  emphasis  of  implementation  con¬ 
siderations  given  in  this  paper  is  on  CMOS  Multi-Chip 
Module  (MCM)  technology  due  to  its  ability  to  achieve 
fast  inter-chip  communication.  Section  4  discusses  various 
design  issues  involved  using  the  MCM  approach  to  imple¬ 
ment  the  MCAPs.  Such  considerations  include  transistor 
count,  loading,  estimate  of  speed,  and  power  dissipation. 

2  MCAP  Architecture 

An  MCAP  is  an  attached  processor  that  is  constructed  en¬ 
tirely  from  a  standard  set  of  connections  and  components. 
This  standard  set  consists  of  two  types  of  asynchronous 
connections  and  twelve  types  of  components.  The  def¬ 
initions  of  the  connection  and  component  types  provide 
a  standard  set  of  rules  that  allow  the  componenu  to  be 
easily  configured  in  different  ways  to  construct  attached 
processors  that  can  efficiently  perform  different  sets  of  al¬ 
gorithms. 

An  MCAP  has  exactly  one  instruction  component  and  it  is 
connected  to  a  memory  component  for  storing  instructions. 
Most  of  this  memory  component  is  a  ROM  that  contains 
the  subprograms  needed  to  execute  the  algorithms,  but 
some  of  it  is  a  RAM  that  can  receive  instructions  (those 
that  initiate  the  subprograms)  from  the  host. 

An  MCAP  operates  by  drawing  an  instruction  stream  from 
the  memory  component  into  the  instrnction  component. 
The  instruction  component  nses  internal  instructions  in 
the  stream  to  form  external  instrnctions  that  are  then  dis- 
tribnted  to  the  other  non-memory  components  through  the 
MCAP’s  (one  and  only)  bos  component.  The  instruction 
stream  is  illustrated  in  Fig.  1.  Note  that  all  components 


in  the  instruction  stream  include  input  instruction  queues. 
When  the  non-memory  components  have  received  all  of  the 
instructions  needed  to  perform  an  algorithm,  they  auto¬ 
matically  prefetch  the  data  from  the  memory  components, 
route  the  data  to  and  from  the  processor  components  and 
store  the  results  back  into  the  memory  components.  Some 
controller  components,  which  are  the  components  that  su¬ 
pervise  all  memory  accessing,  are  used  to  automatically 
transfer  data  between  the  host’s  main  memory  and  the 
MCAP's  memory  components.  The  instruction  and  data 
streams  are  separate,  thereby  allowing  the  instructions 
needed  for  the  next  algorithm  to  be  distributed  while  the 
current  algorithm  is  executing. 


Instructions  from  HOST 


Other  non-memory  components 


Fig.  1.  The  instruction  stream 

The  two  types  of  connections  are  referred  to  as  instruc¬ 
tion  and  data  connections.  These  connections  are  asyn¬ 
chronous  and,  therefore,  must  include  handshaking  lines  as 
well  as  data  and,  perhaps,  address  lines.  Because  memory 
components  are  connected  to  controller  components  only, 
they  are  an  integrate  part  of  controller/memory  subsys¬ 
tems.  Therefore,  the  exact  conttoller/memory  connection 
specificatios  are  left  to  the  subsystem  designer  and  may  be 
synchronous. 

Instrnction  connections  are  for  passing  instrnctions  from 
the  instruction  component  to  the  bus  component  and  from 
the  bus  component  to  one  of  the  other  non-memory  compo¬ 
nents.  An  iLstruction  connection  consists  of  nnidirectional 
instrnction  and  address  buses  and  a  Req/Ack  handshaking 
pair.  The  component  that  is  to  receive  the  instrnction  is 
indicated  by  the  a  component  number  on  the  address  bos. 
A  transfer  is  initiated  when  the  sending  component  pnts 


ui  address  on  the  nddresn  bus,  sa  instruction  on  the  in¬ 
struction  bus  snd  begins  the  hsndshsking.  Except  for  the 
connections  to  memory  components,  sll  connections  used 
to  trnnsfei  dnts  me  dstn  connections.  They  sre  used  to 
psss  dsts  to  snd  from  the  processors  snd  consist  of  only  s 
unidirectionsl  dsts  bus  snd  s  Req/Ack  psir.  A  dsts  trsns- 
fer  consists  of  plscing  dsts  on  the  dsts  bus  snd  initisting 
the  hsndshsking.  Except  for  s  write  to  s  memory  compo¬ 
nent,  sll  trsnsfers  include  the  Istching  of  sn  instruction  or 
dstum  into  s  queue  st  the  receiving  end. 

The  twelve  types  of  components  sre  divided  into  six  cste- 
gories  as  indicsted  below: 

Instruction 

Bus 

Memory 

Processor 

Elementary-one  input,  one  output 
Two-input-two  inputs,  one  output 
Comparator-two  inputs,  one  output  plus 
special  outputs 
Router 

Join-multiple  inputs,  one  output 
Fork-one  input,  multiple  outputs 
Link-multiple  inputs,  multiple  outputs 
Controller 

RAM-internal  to  MCAP,  no  partitions 
Single-sccess-internsl  to  MCAP,  has 
partitions 

Dusl-access-connects  to  main  memory,  has 
partitions 

As  mentioned  earlier,  sn  MCAP  contains  one  memory 
component  for  storing  instructions,  one  instruction  com¬ 
ponent  for  executing  internal  instructions  snd  forming  ex¬ 
ternal  instructions,  and  one  bus  component  for  distribut¬ 
ing  the  instructions.  An  MCAP  may  contain  several  con¬ 
troller,  router,  and  processor  components  and  several  other 
memory  components  for  storing  data.  However,  the  other 
memory  components  can  be  connected  to  controller  com¬ 
ponents  only.  Only  controller  components  ate  capable  of 
being  programmed  to  prefetch  data  from  and  deposit  data 
into  data  memory  components.  Although  the  instruction 
memory  component  or  a  dual-access  component  can  be 
connected  to  the  host  system,  all  other  components  can  be 
connected  to  the  MCAP’s  components  only. 

Each  non-memory  component  that  is  used  during  the  exe¬ 
cution  of  an  algorithm  contains  an  instruction  input  queue, 
one  or  more  data  input  queues,  and  control  logic  that  in¬ 
cludes  a  number  of  registers.  For  example,  a  typical  ele¬ 
mentary  component  is  shown  in  Fig.  2.  The  instructions 
for  an  algorithm  received  by  a  component  fill  these  reg¬ 
isters  and  then  the  register  contents  dictate  the  activity 
within  the  component  while  the  algorithm  is  executed. 
They  determine  the  component’s  mode  and,  for  a  rout¬ 
ing  component,  the  patterns  for  accepting  inputs  and  dis¬ 
tributing  outputs.  For  a  controller  component,  they  de¬ 
termine  the  memory  partitions,  DMA  accessing  patterns. 


and  patterns  for  prefelching  the  operands  needed  by  the 
algorithm. 


Fig.  2.  Block  diagram  of  an  elerneotary  component 


Each  of  the  components  that  receives  instructions  contains 
a  Number  of  Operands  Output  (NumOpsOut)  register  that 
is  always  the  last  register  filled  before  the  component  be¬ 
gins  its  part  in  the  execution  of  the  algorithm.  Each  time 
the  component  outputs  an  operand,  the  NumOpeOnt  reg¬ 
ister  is  decremented.  When  the  NumOpsOut  rejpster  be¬ 
comes  zero,  the  component  has  completed  its  part  in  exe¬ 
cuting  the  current  algorithm.  It  may  then  distribute  new 
values,  those  needed  for  the  next  algorithm,  from  its  in¬ 
struction  input  queue  to  its  registers.  This  cycle  may  con¬ 
tinue  indefiititely.  Except  for  reacting  to  the  handshaking 
(i.e.,  Req  and  Ack)  signals  in  its  connections,  each  compo¬ 
nent  acts  independently.  The  data  is  input  to  a  data  queue 
through  an  input  connection,  processed  or  routed  through 
a  bus,  and  output  through  an  output  connection.  Because 
separate  queues  are  used  to  input  instructions  and  data, 
the  instruction  and  data  streams  are  completely  separate. 

The  processor  components  are  used  for  performing  unary 
and  binary  aritbmetic/Iogic  operations.  There  are  three 
types  of  processor  components.  There  are  one-input  ele¬ 
mentary  (E)  components,  two-input  (T)  components,  and 
comparator  (C)  components.  These  components  contain 
only  two  registers,  a  mode  register  and  a  NumOpsOut  reg¬ 
ister.  The  mode  re^ster  dictates  th'ii  actions  taken  by  the 
component  and  the  NumOpsOut  register  gives  the  total 
number  of  operands  that  is  to  be  output  before  the  cur¬ 
rent  algorithm  is  completed.  Both  the  E  and  T  compo¬ 
nents  may  be  used  for  either  unary  or  binary  operations, 
depending  on  the  mode.  When  an  E  component  is  used  for 


«  binuy  operttion  it  must,  of  course,  input  both  opernnds 
through  its  single  input  connection.  A  T  component  per* 
forming  n  unary  operation  would  use  only  one  of  its  two 
input  connections. 

A  C  component  is  similar  to  a  T  component,  but  has  two 
special  sets  of  lines  connecting  it  to  the  instruction  compo¬ 
nent.  There  can  be  only  one  C  component  in  an  MCAP.  As 
usual,  its  current  function  is  determined  by  its  mode.  One 
of  its  functions  is  to  simply  compare  two  inputs  and  set  re¬ 
lational  flags  that  are  then  transmitted  to  the  instruction 
component  over  one  set  of  the  special  lines.  When  per¬ 
forming  comparisons,  there  are  no  outputs  other  than  the 
flag  outputs.  The  C  component  can,  however,  also  deter¬ 
mine  the  maximum  or  minimum  of  a  sequence  of  numbers. 
In  this  case,  the  second  set  of  special  lines  is  used  to  output 
the  index  of  the  maximum  or  minimum  to  the  instruction 
component.  The  maximum  or  minimum  is  output  on  the 
output  data  connection. 

Routing  components  ate  for  directing  data  along  the 
proper  paths.  There  are  three  types  of  routing  compo¬ 
nents,  join  (J)  components  with  more  than  one  input  and 
one  output,  fork  (P)  components  with  one  input  and  mote 
tham  one  output,  and  link  (L)  components  with  mote  than 
one  input  and  mote  than  one  output.  In  addition  to  the 
mode  and  NumOpsOnt  registers,  they  contain  registers  for 
dictating  their  input  and  output  patterns  while  the  cur¬ 
rent  algorithm  is  being  executed.  F  and  L  components 
may  include  broadcasting  in  their  output  patterns.  J  and 
F  components  may  be  used  in  conjunction  with  T  and  E 
components  to  form  pipelines  with  feedback  that  can  ac¬ 
cumulate  sums. 

There  are  three  types  of  controller  components,  RAM  (R) 
components,  single-access  (S)  components,  and  dual-access 
(D)  components.  All  controller  components  are  for  auto¬ 
matically  retrieving  operands  from  and  storing  results  in 
their  associated  memory  components.  In  addition,  a  D 
component  includes  connections  communicating  with  the 
host’s  main  memory.  All  controller  components  have  an 
output  data  connection  for  outputting  operands  to  the  re¬ 
mainder  of  the  MCAP  and  an  input  data  connection  for 
inputting  results  from  the  MCAP.  Therefore,  they  must 
be  capable  of  handling  both  an  output  data  stream  and 
an  input  data  stream.  A  queue  is  inserted  in  each  of  these 
data  streams.  A  D  controller  also  has  input  and  output 
data  stream  for  transferring  data  to  and  from  the  host’s 
main  memory. 

A  significant  difference  between  the  controller  components 
and  the  other  programmable  components  is  that  a  Number 
of  Operands  In  (NumOpsIn)  register  as  well  as  a  NumOp¬ 
sOnt  register  must  be  included.  The  NumOpsIn  register 
serves  the  same  purpose  for  the  input  data  stream  as  No- 
mOpsOut  does  for  the  output  stream.  An  S  component 
differs  from  an  R  component  is  that  its  memory  may  be 
divided  into  partitions  that  consist  of  blocks  of  memory 
having  consecutive  addresses.  The  memory  components 
are  interleaved  so  that  the  partitions,  because  they  occupy 
consecutive  addresses,  are  spread  across  the  componenu. 


In  addition  to  the  mode,  NumOpsOul  and  NumOpsIn  reg¬ 
isters,  an  S  component  contains  registers  for  specifying  the 
patterns  for  accessing  the  partitions  and  a  set  of  registers 
for  each  paititioa  for  specifying  the  pattern  of  accesses 
within  the  partition. 

That  portion  of  a  D  component  that  communicates  with 
the  MCAP  is  similar  to  an  S  component  except  that  a  Nu¬ 
mOpsOnt  register  is  needed  for  each  output  stream  and  a 
NumOpsIn  register  is  needed  for  each  input  stream.  Both 
S  and  D  components  include  a  window  which  is  a  set  of 
memory  locations  with  consecutive  addresses  whose  base 
address  increments  after  each  repetition  of  a  pattern.  The 
purpose  of  the  window  is  to  separate  the  input  from  the 
output.  Data  that  are  output  from  the  partition  must  in¬ 
volve  accesses  that  ate  within  the  window  and  inputs  to 
the  partition  must  involve  accesses  that  ate  outside  the 
window.  Because  a  partition  is  treated  as  a  circular  mem¬ 
ory,  the  location  with  the  highest  addres.s  in  the  p  rtition 
is  considered  to  be  adjacent  to  the  one  with  the  lowest 
address  and  the  window  is  considered  to  move  in  a  circle. 


Fig.  3.  An  example  MCAP  architecture 

An  example  architecture  is  given  in  Fig.  3.  Its  processing 
subsection  includes  a  comparator  (C  component),  a  negv 
tor  (E  component),  a  redprocator  (E  component),  a  set  of 
font  pipelined  adders  capable  of  accumulation,  and  a  set  of 
four  pipelined  multipliers.  Each  adder  or  multiplier  is  con- 


•tructed  of  four  alkse*  (4  T  component  followed  by  three 
E  components).  All  communicntions  to  nnd  from  the  pro¬ 
cessing  components  nre  through  six  L  components,  three 
on  ench  side  of  the  processor.  J  nnd  F.compoaentt.nrc 
provided  to  nilow  flexible  use  of  the  L  components.  Also, 
to  4II0W  for  nccumuUtion  there  is  4  feedbsck  connection 
between  the  F  component  4t  the  output  from  ench  ndder 
nnd  the  J  component  nt  the  input  to  the  ndder.  There 
is  4  D  component  to  provide  intermedinte  memory  nnd  4 
connection  to  mnin  memory.  The  S  component  provides 
interns!  stoinge. 

3  Matching  Algorithms  to  Architec¬ 
tures 

In  order  to  efficiently  use  the  nvniinble  logic  nnd  intercon¬ 
nections,  nn  architecture  must  be  carefully  matched  to  an 
algorithm  or  set  of  algorithms.  This  involves  a  study  relat¬ 
ing  the  flows,  storage  and  processing  of  the  data  required 
by  the  algorithm(s).  Clearly,  there  is  no  point  in  increasing 
the  speed  of  a  processing  subsystem  if  the  current  intercon¬ 
nections  and  memory  hierarchy  are  inadequate  to  support 
the  processing  (or  vice  versa).  But  a  good  balance  for  one 
algorithm  may  not  be  a  good  balance  for  a  different  algo¬ 
rithm.  What  is  needed  is  a  satisfactory  tradeoff  for  the 
work  mix  expected  of  a  system  and  a  means  of  evaluating 
the  design  parameters  chosen. 

Space  allows  only  a  single  example,  so  let  us  consider  the 
computation  that  most  frequently  occurs  in  computation¬ 
ally  intense  algorithms,  matrix  multiplication.  Let  us  ex¬ 
amine  how  the  MCAP  in  Fig.  3  could  be  analyied  rela¬ 
tive  to  the  algorithm  AB  —  C  using  the  middle  product 
method  [3]  where  A,  B  and  C  are  n  x  n  matir.ces.  Fig.  4 
shows  the  required  flow  of  data  through  the  MCAP.  The 
variable  m  is  the  number  of  rows  that  can  be  simultane¬ 
ously  stored  in  each  of  the  D  and  S  component  memories. 
The  expressions  give  the  total  numbers  of  operands  trans¬ 
ferred  between  the  major  subsystems. 

The  algorithm  consists  of  the  computations 
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where  the  «,jS  are  the  elements  of  A,  the  BjS  are  the  rows 
of  B,  and  the  C,s  are  the  rows  of  C.  The  algorithm  pro¬ 
ceeds  by  storing  the  first  m  elements  of  the  first  column  of 
A  and  the  first  m  rows  of  B  in  the  D  component’s  memory. 
Then  the  products  aaBi,  for  i  =  l,...,m,  are  formed  and 
stored  in  the  S  component.  Next,  the  first  m  elements  of 
the  second  column  of  A  are  brought  into  the  D  compo¬ 
nent  and  the  products  a,]  fit  ate  formed  and  added  to  the 
corresponding  previous  products,  with  the  results  bring  re¬ 
turned  to  the  S  component.  This  is  repeated  n/m— 1  times, 
but  the  last  time  the  product  totals,  which  are  the  first  m 
rows  of  C,  are  put  in  the  D  component  and  then  output  to 
main  memory.  The  entire  process  is  repeated  n/m  times. 
Overlapping  can  be  nsed  to  reduce  the  reqmred  time. 


By  matching  this  Uhm  with  the  architecture  in  Fig.  3, 
it  is  seen  that  each  adder  and  multiplier  must  perform 
approximately  n*/2  operations  and  each  link  on  the  left 
and  two  pf  the. links  on  the  right  must  perform  approxi¬ 
mately  n*  transfers.  (The  third  link  on  the  right  is  hot 
be  needed.)  The  approximate  numbers  of  accesses  to  the 
S  component,  D  component  and  main  memory  are  about 
2»*,  »*(1  -i-  1/m)  and  n*/m,  respectively.  If  T  is  the  pet 
stage  processing  time  of  the  multipliers,  then  T  should 
also  be  the  per  stage  processing  time  of  the  adders  and 
T/i  should  be  the  transfer  time  of  the  links.  The  access 
times  of  the  S  component,  D  component  and  main  memory 
should  be  T/8,  mT/4(m  -b  1)  and  mT/4,  respectively  for 
both  reads  and  writes.  For  T  =  40  ns  and  m  =  8,  the  link 
transfer  time  should  be  10ns  and  the  average  memory  ac¬ 
cess  times  should  be  5  ns,  9  ns  and  80  ns.  The  computation 
rate  would  be  200  Mflops  per  second.  If  the  MCAP  were 
put  into  an  MCM  or  wafer  and  memory  interleaving  were 
used,  these  times  would  certainly  be  within  the  capability 
of  current  HCMOS  technology.  (The  join  and  fork  compo¬ 
nents  were  ignored  in  this  discussion  because  the  commu¬ 
nication  times  are  dictated  by  the  slower  link  components.) 
BiCMOS  and  GaAs  could  produce  proportionately  faster 
processing,  memory  and  memory  controller  components, 
but,  as  seen  in  the  next  section,  increasing  the  speed  of 
the  link  components  is  a  more  challenging  problem. 


Fig.  4.  Data  flow  for  matrix  multiplicatioa 

Except  for  the  unused  link  component,  the  design  would 
utilise  the  link  and  processor  components  over  95%  of  the 
time  while  performing  a  matrix  multiplication.  In  contrast. 


note  thkt  mntrix  nddition  would  utilixe  these  components 
only  nbout  50%  of  the  time  on  the  nvetage  with  the  S, 
multiplier  tnd  some  of  the  routing  components  not  being 
nsed  nt  nil.  This  contrast  pointn^nt  the  need  for  different 
designs  for  different  algorithms  and  the  need  for  compro¬ 
mise  when  a  set  of  algorithms  must  be  executed  on  the 
same  architecture. 


4  MCM  IMPLEMENTATION  CON¬ 
SIDERATIONS 

Since  the  signal  delays  associated  with  a  PCB  implemen¬ 
tation  are  expected  to  be  prohibitively  excessive,  it  is 
thought  that  the  fabrication  of  an  MCAP  in  a  Multi-Chip 
Module  (MCM)  configuration  or  Wafer  Scale  Integration 
(WSI)  are  the  only  realistic  alternatives  for  attaining  high- 
performance.  Some  important  design  considerations  for 
implementing  an  example  MCAP  architecture  in  MCM 
configuration  ate  presented  in  this  section. 

Fig.  5  shows  a  layout  for  an  MCM  implementation  of  the 
example  architecture.  In  designing  this  layout,  we  aimed 
toward  minimizing  chip  to  chip  interconnections,  maxi¬ 
mizing  interconnection  densities,  and  using  a  parallel  ar¬ 
chitecture.  Other  factors  of  importance  ate  ground  and 
power  plane  generation  and  physical  design  verification. 
The  amount  of  heat  generate  is  directly  dependent  on 
the  type  of  substrate  (MCM’s  are  classified  according  to 
the  substrate  technology;  MCM-C,  MCM-D,  and  MCM- 
L),  selection  of  bonding  and  placement  of  chips.  Parasitics 
on  the  interconnects,  inductances  on  the  power  lines  and 
the  I/O  pin  limitation  are  other  important  considerations. 
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F^.  5.  Layout  of  MCAP  on  an  MCM 


4.1  The  transistor  count 

In  estimating  the  total  number  of  transistors  required 
to  build  the  proposed  MCAP,  we  made  the  assumption 
that  the  technology  nsed  is  high-speed  CMOS.  CMOS  was 
picked  as  the  first  benchmark  technology  because  of  its 
commercial  maturity.  In  the  future,  faster  technologies 
such  as  GaAs  will  be  evaluated.  As  an  example,  let  ns 
consider  a  pipelined  64-bit  floating  point  adder  with  four 
stages.  It  has;  (1)  nine  64-bit  registers  with  4032  transis¬ 
tors  (7  transistors  per  bit  for  a  dynamic  latch),  (3)  seventy- 
four  3-inpnt  XOR  gates  with  592  transistors,  (3)  one  hun¬ 
dred  and  twenty-six  2  to  1  MUX’s  with  504  transistors, 
(4)  two  11-bit  adders  with  528  transistors,  (5)  one  52-bit 
adder  with  1248  transistors,  (6)  a  64-bit  leading  zero  de¬ 
tector  with  5000  transistors,  (7)  two  52-bit  barrel  shifters 
with  4000  transistors,  and  (8)  rounding  and  other  control 
logic  taking  6500  transistors. 

The  total  is  23K  transistors  for  an  adder.  By  having  four 
pipelined  stages,  we  can  achieve  stage  delays  of  less  than 
20  ns  [8].  This  delay  is  of  course  expected  to  be  even 
smaller  for  faster  technologies  like  GaAs.  Similarly,  we 
can  evaluate  the  number  of  transistors  for  a  pipelined  64- 
bit  floating  point  multiplier  (using  an  optimized,  modified 
Booth’s  algorithm)  and  arrive  at  a  total  of  S8K  transis¬ 
tors.  Again  with  four  pipelined  stages,  the  delay  per  stage 
is  less  than  20  ns  [8].  Following  this  procedure,  the  tran¬ 
sistor  count  for  the  rest  of  the  elements  in  the  MCAP  are 
calculated  and  Table  1  gives  the  count  for  the  various  com¬ 
ponents.  A  figure  of  approximately  ten  million  is  reckoned 
as  the  transistor  count  to  build  the  whole  MCAP. 

In  the  proposed  architecture,  the  bottleneck  is  the  com¬ 
munication  through  the  LINK  elements  because  of  their 
high  fan-out  and  relatively  large  interconnection  distances. 
This  means  that  the  output  buffers  for  these  elements  must 
be  relatively  large.  Next,  we  present  the  delay,  power 
and  area  calculations  for  the  output  buffers  as  functions 
of  fan-out  (F)  and  interconnection  length  (f).  For  these 
calculations,  (I)  The  input  capacitance  of  a  gate  includ¬ 
ing  the  lead  and  ESD  capacitance  is  C,„  =  1  pF;  (2)  The 
width  of  the  metal  conductor  used  for  an  interconnection 
is  to  =  25^m;  (3)  The  capacitance  of  the  metal  conductor 
is  Cmi  =  30  aF/pm*;  (4)  The  sheet  resistance  of  the  metal 
is  R,  =  0.050/0;  (5)  The  feature  size  is  X  =  0.5pm. 

4.2  Load  capacitance 

For  the  load  capacitance  Ci  =  C,„,  +  fx  C,„,  with  C,nt  = 
wx/xCmi  =  (0.025) xfx30xl0">*x  10*  pF  =  0.75x/  pF 
where  /  is  in  mm  and  =  1  pF.  The  resistance  of  the 
interconnect  is 

Am  =  A,  X  (//»)  =  0.05  X  (//0.025)  =  2  x  /  0(1) 

Thus,  a  LINK  dement  with  a  fan-out  of  19  and  with  an  av- 
erage  interconnection  length  of  2  cm  has  load  capacitance 
of  34  pF. 

4.3  Average  delay 

it  is  known  that,  in  general,  the  minimum  size  of  a  logic 
gate  has  a  W/L  ratio  of  2.  So,  we  start  with  a  ratio  of 


2  And  go  to  higher  vaJuc*  in  stages  in  order  to  drive  a 
load  within  a  short  time.  By  dividing  the  buffering  stages 
into  the  number  of  buffers  with  increasing  W/L,  optimum 
speeds  can  be  achieved.  It  has  been  found  that  a  stage 
ratio  of  3  [9]  gives  best  results.  Also,  the  optimum  number 
of  stages  is  Af  =  0.91(lnCi  4-4.19),  where  N  is  truncated  to 
the  nearest  integer.  Using  the  optimum  number  of  stages, 
the  average  delay  is 

T..,  =  0.484(Af  -  I)  +  5C,/3(fV  -  1)  +  0.076  ns  (2) 

The  plot  of  T.,,  as  a  function  of  F  and  t  is  shown  in  Fig.  6. 
For  the  example  with  F  =  19  and  f  s=  20  mm,  the  delay 
time  is  seen  to  be  3.2  ns. 


Fig.  6.  Buffer  and  interconnect  delay 

4.4  Buffer  area 

A  simple  inverter  with  (W/L).,  =  (W/L)p  =  2  will  need 
an  area  of  66  pm^.  A  buffer  with  equal  rise  (tr)  and  fall 
(</)  times  requires  (W/L)p  =  2(W/L)i,  =  4  and  the  area  is 
going  to  be  150  pm’.  The  total  area  of  the  buffer  depends 
on  the  number  of  stages  and,  hence,  is  a  function  of  F  and 

t.  We  have  Area  =  66  +  3(14(JV  —  1)  -f  36(1  3  4-  3*  -f - 1- 

3''“’)]  w  55  X  3''"’  pm’ 

4.5  Power  dissipation  in  the  buffer 

In  CMOS,  most  of  the  power  is  dissipated  during  switching 
and,  hence,  dynamic  power  is  approximately  equal  to  the 
total  power.  The  dynamic  power  is  P4  *  Ct  x  e’  x  = 
25(C,  4-C6a//)/T.„.  where  =  0.0152(3''-’)  pF 

Since  the  design  of  an  MCAP  uses  asynchronous  communi¬ 
cation,  the  transfers  over  a  LINK  component  involves  the 
return  of  an  acknowledge  agnal  and  the  transmission  of  an 
output  enable  signal.  It  is  estimated  that  the  transfer  rate 
may  be  as  high  as  /  s  l/2Taaf  Hi.  For  F  =  19  and  t  —  20 
mm,  the  total  power  dissipated  by  buffer  is  175  mW. 


4.6  Thermal  management 

There  have  been  successive  revolutions  in  device  technolo¬ 
gies,  proceeding  from  TTL,  ECL  and  NMOS  to  the  re- 
-eenl  high-speed  CMOS,  BiCMOS  and  GaAs.  Three  I* 
five  orders  of  magnitude  reduction  in  minimal  feature  site, 
an  order  of  magnitude  in  the  characteristic  chip  dimen- 
lion  and,  mote  importantly,  a  significant  drop  in  the  tran¬ 
sistor  switching  energy  from  more  than  t0~*  J  to  nearly 
IQ-’*  J  (10).  Power  dissipation,  in  a  leading  edge  bipolar 
chip,  with  1  cm’  area  has  reached  20  -  25  W,  and  based  on 
a  short  term  extrapolation  of  current  trends  in  the  packag¬ 
ing  technology,  it  may  well  be  anticipated  that  the  power 
dissipation  might  approach  mote  than  100  watts  for  50  mil¬ 
lion  transistors  on  the  same  1  cm’  area  with  a  swiubiog 
speed  of  10  ps  [10],  After  comparing  various  existing  VLSI 
modules  in  terms  of  thermal  parameters  [10],  the  value  of 
heat  flux,  Q  =  25  W/cm’  seems  to  be  reasonable  for  air 
cooling.  Considering  again  the  critical  LINK  element  in 
the  MCAP,  we  estimate  Q  =  14.24  W/cm’  to  drive  2  cm 
of  interconnect  and  19  gates.  It  is  reasonable  to  expect, 
therefore,  that  for  a  MCAP  architecture  implemented  in 
MCM,  air  cooling  would  be  sufficient. 

5  Conclusions 

The  architecture  to  implement  a  class  of  high-performance 
attached  processors,  which  can  be  modularly  configured 
to  match  given  sets  of  algorithms,  has  been  presented. 
The  high  utilization  rate  of  the  processing  components  is 
achieved  mainly  by  (1)  minimizing  the  movement  of  in¬ 
termediate  results;  (2)  prefetching  almost  all  operands  us¬ 
ing  intelligent  memory  controllers;  and  (3)  reconfiguring 
(through  programming)  the  interconnection  of  the  process¬ 
ing  components  to  match  the  needs  of  a  given  algorithm. 

An  example  MCAP  architecture  was  evaluated  for  MCM 
implementation.  Because  of  its  commercial  maturity,  the 
CMOS  technology  was  picked  as  the  first  (benchmark) 
technology  to  be  evaluated.  Transistor  count  for  imple¬ 
menting  the  MCAP  was  estimated  at  9.85  million.  In 
the  proposed  architecture,  the  bottleneck  is  the  commu¬ 
nication  through  the  LINK  elements  because  of  their  high 
fan-out  and  relatively  large  interconnection  distances.  For 
the  LINK  element  output  buffers,  delay,  power  and  area 
calculations  were  made  as  functions  of  fan-out  and  inter¬ 
connection  length.  For  example,  a  LINK  element  with  a 
fan-out  of  19  and  an  average  interconnection  length  of  2 
cm  has  a  load  capacitance  of  34  pF,  has  a  delay  time  of 
3.2  ns,  occupies  an  area  of  40,000  pm’  and  dissipates  175 
mW  of  power.  Heat  flux  was  estimated  at  14.2  W/cm’, 
which  leads  us  to  believe  that  air  cooling  will  be  sufficient 
for  this  MCAP  architecture  implemented  in  MCM. 

Further  improvements  in  MCAP  performance  could  be  ob¬ 
tained  by:  (1)  Reducing  the  minimal  feature  size  to  0.5 
pm  or  0.2  pm,  (2)  Minimizing  the  chip  to  chip  spacing  by 
mounting  the  chips  on  two  sides,  (3)  Employing  a  higher 
speed  technology  like  GaAs  (  HEMT’s),  (4).  Perhaps,  us¬ 
ing  Wafer  Scale  Integration. 
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Abstract 

A  new  architecture  for  high-performance  parallel  attached 
proceijors  is  studied  in  thij  paper.  The  unique  features  are 
that  the  attached  processor  can  be  configured  to  match  a 
set  of  algorithms  and  its  memory  controllers  can  be  pro¬ 
grammed  to  fit  the  access  patterns  required  by  the  algo¬ 
rithms.  As  a  result,  high  utilisation  of  the  processing  logic 
for  given  sets  of  algorithms  can  be  obtained.  A  simula¬ 
tor  with  interactive  graphic  interface  is  designed  to  study 
the  performance  of  the  proposed  architecture.  4n  example 
based  on  matrix  multiplication  is  used  for  illustration.  The 
simulation  results  show  that  a  sustained  execution  rate  as 
high  as  95%  of  the  peak  speed  for  matrices  with  a  sixe  of 
1 28  X  1 28  can  6e  achieved  in  the  proposed  attached  processor 
architecture.  If  CMOS  technology  w  chojen  to  implement 
the  MCAP  architecture,  a  sustained  speed  of  190  MFLOPS 
can  be  obtained  for  matrix  multiplication  with  four  multi¬ 
pliers  and  four  adders. ' 

1  Introduction 

An  nttached,  or  back-end,  processor  is  a  processing  syv 
tem  that  is  connected  to  a  host  computer  for  the  purpose 
of  very  quickly  executing  most  of  the  overall  system’s  com¬ 
putational  tasks.  In  sack  an  organization,  "the  host  is  a 
program  manager  which  handles  all  I/O,  code  compiling, 
and  operating  system  functions,  while  the  back-end  at¬ 
tached  processor  concentrates  on  arithmetic  computation 
with  data  supplied  by  the  host  machine”  (I). 

The  specific  purpose  of  an  attached  processor  is  to  exe¬ 
cute  members  of  a  set  of  algorithms  very  quickly.  The 
broader  the  set  of  algorithms  the  more  generally  applica¬ 
ble  the  attached  processor.  The  underlying  goal  of  the 
designer  is  to  efficiently  utilize  the  hardware  for  as  broad 
a  set  of  algorithms  as  possible.  However,  for  most  current 
designs,  the  average  sustainable  execution  rates  have  been 
found  to  be  only  to  20%  of  their  peak  rates,  which 

'The  work  reported  in  this  paper  was  supported  in  part  by 
the  Office  of  Naval  Research  under  Grant  No.  N00014-93-1- 
1343.  Any  opinions,  findings,  and  conclusions  or  recommenda¬ 
tions  expressed  in  this  paper  ore  those  of  the  authors  and  do 
not  necessarily  reflect  the  view  of  the  funding  agency. 


are  determined  by  summing  the  maximum  computational 
rates  of  the  processing  elements.  For  example,  the  sus¬ 
tainable  rate  for  a  Cray  X-MP  with  four  processors  may 
be  as  low  as  S%  for  some  algorithms  [2].  Also  extensive 
evaluations  of  recent  high-performance  computations  us¬ 
ing  Lapack  are  given  in  [3]  and  using  NSA  parallel  bench¬ 
marks  are  given  in  (4).  These  evaluations  confirm  the  low 
efficiencies  of  most  supercomputers.  Although  some  of  the 
lost  efficiency  it  necessitated  by  the  algorithms,  much  of 
it  is  due  to  memory  accessing  and  contention  for  shared 
resott.  ,  in  general,  including  internal  buses. 

Described  in  this  paper  is  a  class  of  high-performance  at¬ 
tached  processors  called  Modularly  Configurable  Attached 
Processors  (MCAPs)  which  can  attain  quickness  and  high 
utilization  through; 

s  Closely  matching  their  architectures  to  the  set  of  al¬ 
gorithms  they  are  to  execute. 

a  Overlapping  of  processing  and  memory  accessing  by 
using  memory  prefetching. 

a  Minimizing  the  movement  of  data. 

a  Using  a  high-speed  technology  with  MCM  or  wafer 
scale  implementations. 


An  MCAP  is  constructed  from  the  component  types  spec¬ 
ified  in  Sec.  2.  These  component  types  are  such  that 
each  member  of  the  class  may  include  parallel  processing 
and  memory-to-memory  pipelines,  and  be  constructed  in 
a  building  block  fashion.  They  encompass  routing  com¬ 
ponents  (including  buses)  as  well  as  memory,  coi.'.rol,  and 
processing  components.  By  overlapping  processing  with 
memory  accessing  and  matching  an  architecture  with  a  set 
of  algorithms,  it  is  predicted  that  the  average  sustainable 
rate  for  a  specific  set  of  algorithms  can  attain  at  least  60% 
of  the  peak  rate.  By  defining  components  that  are  sim¬ 
ple  enough  to  be  fabricated  onto  single  low-density  ICs,  a 
high-speed  technology  may  be  used. 

In  order  to  study  and  analyze  the  performance  of  the 
MCAP  architectnre,  a  set  of  simulation  tools  has  been  de¬ 
veloped.  These  tools  include  an  architecture  editor,  an 
asseirbler,  and  a  simulator.  The  main  objective  of  this  pa¬ 
per  is  to  stndy  the  performance  of  the  MCAP  architecture 


using  the  developed  simuUtion  tools.  Matrix  multiplica¬ 
tion  is  studied  as  as  example  to  illustrate  how  to  match  an 
algorithm  to  a  specific  MCAP  architecture.  Through  the 
simulation,  the  hardware  attributes  of  each  component, 
such  as  the  execution  delay  and  the  size  of  the  data  queue 
can  be  fine  tuned  to  achieve  a  high  sustained  execution 
rate.  For  example,  the  simulation  results  showed  that  for 
matrices  with  a  size  of  128 x  128,  a  sustained  rate  as  high 
as  95%  of  the  peak  speed  can  be  achieved. 

The  rest  of  this  paper  is  organized  as  follows.  Section  2 
briefly  describes  the  architecture  of  the  MCAP  and  the 
fundamental  components  required  to  construct  an  MCAP. 
Section  3  describes  the  simulation  tools  developed  for  the 
performance  analysis  of  the  MCAP  architectures.  Section 
4  shows  how  to  design  a  simulation  program  to  match 
an  algorithm  on  a  given  MCAP  architecture.  Section  5 
shows  the  simulation  results  and  discusses  the  various  de¬ 
sign  principles  for  achieving  a  high  sustained  execution 
speeds  on  MCAP  architectures.  This  paper  concludes  with 
Section  6. 

2  MCAP  Architecture 

An  MCAP  is  an  attached  processor  that  is  constructed  en¬ 
tirely  from  a  standard  set  of  connections  and  components. 
This  standard  set  consists  of  three  types  of  asynchronous 
connections  and  twelve  types  of  components.  The  def¬ 
initions  of  the  connection  and  component  types  provide 
a  standard  set  of  roles  that  allow  the  components  to  be 
easily  configured  in  different  ways  to  construct  attached 
processors  that  can  efficiently  perform  different  sets  of  al¬ 
gorithms. 

An  MCAP  operates  by  drawing  an  instruction  stream  from 
the  memory  component  into  the  instruction  component. 
The  instruction  component  uses  internal  instructions  in 
the  stream  to  form  externa]  instructions  that  ate  then  dis¬ 
tributed  to  the  other  non-memory  components  through  the 
MCAP’s  bus  component.  All  components  in  the  instruc¬ 
tion  stream  include  input  instruction  queues.  When  the 
non-memory  components  have  received  all  of  the  instruc¬ 
tions  needed  to  perform  an  algorithm,  they  automatically 
prefetch  the  data  from  the  memory  components,  route 
the  data  to  and  from  the  processor  components  and  store 
the  results  back  into  the  memory  components.  Ail  non¬ 
memory  components  have  input  data  queues.  DMA  units 
built  into  some  controller  components,  which  ate  the  com¬ 
ponents  that  supervise  all  memory  accessing,  are  used  to 
automatically  transfer  data  between  the  host’s  main  mem¬ 
ory  and  the  MCAP’s  memory  components  while  the  algo¬ 
rithm  is  executing.  Also,  the  instruction  and  data  streams 
ate  separate,  thereby  allowing  the  instructions  needed  for 
the  next  algorithm  to  be  distributed  while  the  current  al¬ 
gorithm  is  execnting. 

An  example  architecture  is  given  in  Fig.  1.  Its  processing 
subsection  includes  a  comparator,  a  negator  (elementary 
component),  a  reciprocator  (elementary),  a  set  of  pipelined 
adders  capable  of  accumulation,  and  a  set  of  pipelined  mul¬ 
tipliers.  Each  adder  or  multiplier  is  constructed  of  four 


stages  pipeline  [a  two-input  (T)  component  followed  by 
three  elementary  (E)  components].  All  communications  to 
and  from  the  processing  components  are  through  six  link 
components,  three  on  each  side  of  the  processor.  Join  and 
fork  components  ate  provided  to  allow  flexible  use  of  the 
link  components.  Also,  to  allow  for  accumulation  there 
is  a  feedback  connection  between  the  fork  component  at 
the  output  from  each  adder  and  the  join  component  at  the 
input  to  the  adder.  There  is  a  dual-access  component  to 
provide  intermediate  memory  and  a  connection  to  main 
memory.  The  single-access  component  provides  internal 
storage.  The  detail  description  of  these  basic  components 
can  be  found  in  [Sj. 


MoToToTnl 


Fig.  1.  An  example  MCAP  architecture. 

3  Simulation  tools  for  the  performance 
analysis  of  the  MCAP  architectures 

Three  CAD  tools  have  been  developed  for  the  performance 
analysis  of  the  MCAP  architectures.  These  tools  include 
an  architecture  editor,  an  assembler,  and  a  simulator.  All 
these  tools  are  written  in  C'^'*'  and  installed  on  a  PC  com¬ 
patible  with  a  486  microprocessor. 

3.1  Architecture  editor 

An  interactive  graphics  editor  is  designed  to  facilitate  the 
construction  of  an  MCAP  architecture.  The  architecture 
editor  provides  the  following  functions  for  constructing  an 
MCAP  architecture:  (1)  Creating,  deleting,  and  moving 


ground  &ny  fundamental  component  defined  in  Section  2 
in  an  ousting  architecture,  (2)  Adding  or  deleting  a  con¬ 
nection  between  any  two  components,  (3)  Modifying  the 
attributes  of  any  component,  such  as  the  execution  time, 
instruction  or  data  queue  size,  number  of  pipeline  stages, 
capacity,  etc.  All  the  above  functions  are  performed  on  an 
interactive  graphic  display,  thus  it  is  very  easy  and  conve¬ 
nient  to  construct  any  kind  of  MCAP  architecture.  The 
output  file  generated  by  the  architecture  editor  is  the  ar¬ 
chitecture  source  file  which  is  later  used  by  the  assembler 
amd  the  simulator.  The  architecture  source  file  specifies 
the  detailed  information  of  an  MCAP  architecture,  such 
as  the  component  ID,  the  number  of  connections  and  their 
connection  numbers,  and  other  attributes  of  each  compo¬ 
nent. 


Fig.  2.  The  simulation  tools  developed  for  simulating  the 
MCAP  architecture. 

3.2  Generation  of  a  load  file  for  the 
graphic  simulator 

The  assembler  compiles  a  simulation  program  and  an  ar¬ 
chitecture  source  file,  then  generates  a  load  file  to  be  used 
by  the  simulator.  The  assembler  runs  two  passes.  In  the 
first  pass,  the  assembler  reads  the  architecture  source  file 
and  builds  two  tables.  The  first  table  associates  the  com¬ 
ponent’s  mnemonic  to  its  corresponding  ID  number,  and 
the  second  table  lists  all  the  connections  along  with  their 
corresponding  source  and  destination  components.  Dar¬ 
ing  the  first  pass,  the  assembler  also  includes  all  defined 
symbols  and  their  values  in  the  first  table.  In  the  second 
pass,  the  assembler  reads  the  simulation  program  again 
and  produces  the  load  statements  for  each  instruction.  If 
the  simulation  program  references  only  components  listed 
in  the  architecture  file,  uo  syntax  errors  are  generated  and 
the  assembler  produces  a  load  file;  otherwise,  the  sssembler 
produces  an  error  tile  (.SLT)  which  lists  the  line  numbers 


in  the  simulation  program  where  the  errors  occur  A  seg¬ 
ment  of  the  load  file  for  the  link  component,  L80,  is  shown 
below. 

S3  L80  S2  jset  the  mode 

57  L80  1  i  135  jset  input  connection 

59  L80  4  4  151  196  190  ;set  the  output  connections 

58  L80  32  5  32767  151  196  190  ;set  output  patterns 
51  LSO  33  ;set  the  number  of  output  operands 

Fig.  2  shows  the  process  of  creating  a  load  file  from  the 
simulation  program  and  an  architecture  source  file. 

3.3  MCAP  Simulator 

The  simulator  is  designed  to  simulate  the  operation  of  each 
component  in  an  MCAP  architecture  while  an  algorithm 
is  being  executed  on  the  architecture.  This  allows  an  ar¬ 
chitecture  to  be  matched  to  an  algorithm.  The  structure 
of  the  state  diagrams  for  the  memory,  instruction  and  bus 
components  are  shown  in  Fig.  3.  Note  that  each  compo¬ 
nent  may  take  on  a  subset  of  the  following  states; 

FREE  —  there  is  no  activity  in  the  component 
DIST  —  the  component  is  waiting  for  an  instruction 
to  distribute 

DBSY  —  an  instruction  is  being  distributed  to  its 
register(s) 

IDLE  —  the  component  is  waiting  for  input 
BUSY  —  the  component  is  being  executed 
Wait  —  the  component  is  waiting  for  its  output  to 
be  taken 

The  simulator  first  brings  in  the  architecture  source  file 
and  the  program  to  be  simulated,  opens  a  result  file,  asks 
the  user  for  the  format  of  the  results  and  then  begins  the 
simulation.  The  pseudocode  for  the  simulator  is 

Retrieve  architecture  source  file 
Retrieve  simulation  program  file 

Open  results  file  and  request  the  format  of  the  results 
Initialize  variables  (includes  setting  the  system  time 
to  zero) 

DO  { 

Update  components  in  BUSY  state 
Update  components  in  WAIT  state 
Update  instruction  and  data  queues 
Update  components  in  IDLE  state 
Update  components  in  FREE  state 
Update  components  in  DIST  state 
Update  components  in  DBSY  state 
Increment  system  time 

If  format  requites  results  to  be  stored,  then  outpnt 
results 

)  While  (not  end  of  simulation) 

Close  results  file 

Inside  the  Do  loop,  the  simulator  first  updates  ail  compo¬ 
nents  currently  in  the  BUSY  state.  Their  execution  times. 


which  are  set  to  their  maximum  values  when  the  BUSY 
state  is  entered,  are  decremented  and,  for  those  that  be¬ 
come  zero,  the  component’s  state  is  changed  (usually  to 
the  IDLE  state,  see  Fig.  3).  As  the  transition  occurs,  ap¬ 
propriate  actions  are  taken.  Similarly,  the  components  in 
the  other  states  are  checked.  If  the  transition  conditions 
are  met  (e  g.,  the  output  has  been  accepted  by  the  suc¬ 
ceeding  component),  then  appropriate  actions  ate  taken 
and  the  component  is  put  into  its  next  state.  The  simu¬ 
lator  is  updated  such  that  a  component  can  have  at  most 
one  state  change  each  time  around  the  loop.  Also,  alt  in¬ 
structions  and  data  queues  are  updated  each  time  around 
the  loop.  The  results  primarily  consist  of  the  times  each 
component  spends  in  each  of  its  states  and  are  recorded  at 
the  times  specified  by  the  results  format. 


(C) 

Fig.  3.  State  diagram  structure  for  the  (a)  memory,  (b) 
instruction,  and  (c)  bus  components. 


4  Matching  an  algorithm  to  an  MCAP 
architecture 

In  order  to  efficiently  use  the  available  logic  and  intercon¬ 
nections,  an  architecture  must  be  carefully  matched  to  an 
algorithm  or  set  of  algorithms.  This  involves  a  study  relat¬ 
ing  the  flows,  storage  and  processing  of  the  data  required 
by  the  algoTithm(s).  Clearly,  there  is  no  point  in  increasing 
the  speed  of  a  processing  subsystem  if  the  current  intercon¬ 
nections  and  memory  hierarchy  ate  inadequate  to  support 
the  processing  (or  vice  versa).  But  a  good  balance  for  one 
algorithm  may  not  be  a  good  balance  for  a  different  algo¬ 
rithm.  What  is  needed  is  a  satisfactory  tradeoff  for  the 
work  mix  expected  of  a  system  and  a  means  of  evaluating 
the  design  parameters  chosen. 

Space  allows  only  a  single  example,  so  let  ns  consider  the 
computation  that  most  ftequently  occurs  in  computation¬ 
ally  intense  algorithms,  matrix  multiplication.  Let  us  ex¬ 
amine  how  the  MCAP  in  Fig.  1  could  be  analyzed  rela¬ 
tive  to  the  algorithm  AB  =  C  using  the  middle  product 


method  [6]  where  A,  B  and  C  are  n  x  n  matrices.  As¬ 
sume  the  S2’s  memory  can  hold  the  entire  matrices  for  the 
multiplication. 

The  algorithm  consists  of  the  computations 

n 

y  a,]  Bj  =  C,  I  =  1 , . . . ,  n 

tmt 

where  the  a,,s  ate  the  elements  of  A,  the  BjS  Are  the  rows 
of  B,  and  the  C,s  are  the  rows  of  C.  The  algorithm  pro¬ 
ceeds  by  storing  the  first  n  elements  of  the  first  column 
of  A  and  the  first  n  tows  of  B  in  the  S2’s  memory.  Then 
the  products  a,iB,,  (or  i  =  l,...,n,  ate  formed  and  stored 
in  the  Si’s  memory.  Next,  the  first  n  elements  of  the  sec¬ 
ond  column  of  A  are  brought  into  the  Si  and  the  products 
a,j  Bi  are  formed  and  added  to  the  corresponding  previous 
products,  with  the  results  being  returned  to  the  Si. 

By  matching  this  algorithm  with  the  architecture  in  Fig.  1, 
it  is  seen  that  each  rulder  and  multiplier  must  perform  ap¬ 
proximately  i»*/2  operations  and  each  link  on  the  left  and 
two  of  the  links  on  the  right  must  perform  approximately 
n'*  transfers.  (The  third  link  on  the  right  is  not  be  needed.) 
The  approximate  number  of  accesses  to  the  S  components 
is  about  2n*.  If  T  is  th'*  per  stage  processing  time  of  the 
multipliers,  then  T  shi  also  be  the  per  stage  processing 
time  of  the  adders  and  T/4  should  be  the  transfer  time  of 
the  links.  The  access  times  of  the  S  components  should 
be  7/8  for  both  reads  and  writes.  For  7  =  40  ns,  the 
link  transfer  time  should  be  10ns  and  the  average  memory 
access  times  should  be  5  ns.  The  computation  rate  would 
be  200  Mflops  per  second.  If  the  MCAP  were  put  into  an 
MCM  or  wafer  and  memory  interleaving  were  used,  these 
times  would  be  within  the  capability  of  current  HCMOS 
technology. 

4.1  Design  of  a  simulation  prog,  am  for  an 
MCAP  Architecture 

To  verify  the  above  simple  analysis,  we  have  designed  a 
set  of  simulation  programs  for  the  matrix  multiplication 
to  be  run  by  the  MCAP  simulator.  The  instruction  set  for 
an  MCAP  architecture  consists  of  two  sets  of  instructions, 
internal  and  external.  The  former  is  processed  within  the 
instruction  component  and  the  latter  is  distributed  by  the 
instruction  component  to  the  corresponding  components. 

In  this  paper  we  wiQ  only  discuss  the  external  instruc¬ 
tion  set.  The  external  instructions  set  consists  of  three 
types  of  instructions;  instructions  which  set  the  number 
of  operands  to  be  output  from  or  input  to  a  component, 
instructions  which  set  the  mode  of  a  component,  and  in¬ 
structions  which  set  input  or  output  connection  patterns 
for  the  router  components  or  partition  and  operand  pat¬ 
terns  for  the  controller  components. 

When  programming  an  MCAP  architecture,  each  com¬ 
ponent  must  be  programmed  individually.  The  opera¬ 
tion  mode,  input/outpnt  patterns,  connections,  number 
of  operands  to  be  input  or  output,  broadcasting  patterns, 
etc.,  are  programmed  for  each  component  using  external 


in9tructions.  For  example,  tefetring  to  Fig.  1,  the  input 
data  stream  is  supplied  from  the  S2  component.  This  data 
stream  passes  through  link  component  and  is  distributed 
among  four  multipliers  (each  multiplier  is  composed  of  one 
T  and  three  E  components),  then  the  output  of  the  multi¬ 
pliers  is  sent  back  to  the  four  adders  for  accumulating  the 
results.  The  SI  component  stores  the  intermediate  results 
output  from  the  adders  and  sends  back  to  the  adders  when 
new  products  are  produced  from  the  multipliers.  The  final 
results  are  stored  back  to  S2  component.  For  the  matrix 
multiplication  discussed  in  the  previous  section  as  an  ex¬ 
ample,  the  instruction  set  for  programming  each  compo¬ 
nent  is  discussed  below. 

4.2  Programming  the  single  access,  S, 
component 

To  compute  one  product  for  one  row  of  the  C  matrix,  one 
needs  to  broadcast  one  element  of  A  and  then  transfer  one 
row  of  B  to  the  multipliers.  Since  there  are  n  products  to 
be  multiplied  and  added  to  form  a  row  of  C,  this  opera¬ 
tion  needs  to  be  repeated  n  times.  So,  the  total  number 
of  operands  to  be  output  from  the  S2  components  to  the 
multipliers  is  n®(n-h  I).  The  memory  access  controller,  i.e. 
S  component,  connects  to  one  or  more  memory  modules 
and  allows  the  storage  and  retrieval  of  data  to  and  from 
these  modules.  Memory  as  a  whole  is  divided  into  a  spec¬ 
ified  number  of  partitions  across  the  modules.  Thus,  the 
consecutive  addresses  of  a  single  partition  extend  over  ad¬ 
jacent  modules  to  reduce  the  average  memory  access  time. 
Also,  partitions  allows  the  data  to  be  stored  in  a  particu¬ 
lar  pattern  according  to  the  requirement  of  the  algorithm. 
Fig.  4  shows  how  the  partitions  are  distributed  across  the 
memory  modules. 


Fig.  4.  Distribution  of  partitions  across  memory  modules. 

To  access  a  memory  location,  the  S  component  alternates 
between  the  partitions  listed  in  the  programmed  partition 
pattern.  As  an  example,  to  store  two  n  x  n  matrices  A 
and  B,  we  divide  the  memory  connected  to  S2  in  the  ex¬ 
ample  architecture  into  two  partitions  using  the  ’spbs’  (S 


partition  base  and  sire)  instruction:  Thus,  the  instruction 
sequence  for  the  S2  component  is  shown  below: 

smod  S2,  mode  ;set  mode 

sopp  S2,  #n,  0,  1  ;set  output  pattern,  1  followed  by  n 
spbs  S2,  0,  0,  N  iset  base  address  for  partition  0 
spbs  S2,  1,  N,  N  ;set  base  address  for  partition  1 
stko  S2,  Nx(n41)  i$et  the  number  of  output  operands 

The  first  instruction  configures  the  S2  component  for  out¬ 
put  only  and  sets  the  output  pattern  of  partitions  0  and  1 
to  a  1  then  n  pattern.  Since  the  S  component  only  outputs 
data,  only  the  output  partition  pattern  needs  to  be  speci¬ 
fied.  The  second  instruction  generates  a  pattern  in  which 
partition  0  is  accessed  once  and  partition  I  is  accessed  n 
times.  This  pattern  is  repeated  N  times.  The  third  in¬ 
struction  defines  partition  0  to  start  at  base  address  0  and 
assigns  it  a  size  of  N.  The  fourth  instruction  defines  par¬ 
tition  1  to  start  at  base  address  N  and  assigns  it  a  size  of 
N.  Using  the  middle  product  method  for  the  matrix  mul¬ 
tiplication,  the  first  element  of  the  first  row  of  matrix  A 
needs  to  be  output  first  followed  by  the  first  row  of  matrix 
B.  Then,  the  second  element  of  the  first  row  of  matrix  A 
is  output  followed  by  the  second  tow  of  matrix  B  and  so 
on. 

Using  the  instructions  shown  above,  S2  will  output  the 
data  stream  Su,  hu,  .  ..  itn,  »n,  hzi,  hn,  ■■■,  Oin, 

hni . hnn  to  produce  the  first  tow  of  C.  Similarly,  S2 

will  output  the  data  stream  stream  uzi,  hn,  . . . ,  hin,  azz, 
hjr. .  .  .,  hn,  .  .  .,  uzn,  hr.1,  .  ■  ■ ,  hn  to  ptodoce  the  second 
row  of  C,  etc. 

4.3  Programming  the  link  component 

The  operation  of  the  link  component,  LINK,  consists  of  the 
broadcasting  of  one  element  of  matrix  A  and  distributing 
a  row  of  matrix  B  among  four  join  components,  for  ex¬ 
ample  J059,  J05T,  J060,  and  J062.  Thus,  the  number  of 
operands  to  be  output  by  LINK  is  N  (1  4  n).  The  fol¬ 
lowing  instructions  program  the  LINK  component  for  the 
matrix  multiplication: 

Irood  LINK,  mode  ;set  mode 

Isip  LINK,  F077  ;set  input  pattern 

Isbp  LINK,  J059,  J0S7,  J060,  j062  ;set  broadcast  pattern 

tsop  LINK,  #u,  b,  J059,  J057,  J060,  j062  ;output  pattern 

Itko  LINK,  NumOpsOutL  ,-set  number  of  output  operands 

The  third  instruction  sets  the  broadcasting  pattern  which 
specifies  the  components  to  receive  the  broadcast  operand. 
The  fourth  instruction  programs  the  LINK  component  to 
perform  a  broadcast,  indicated  by  the  symbol  It,  and  then 
to  distribute  it  operands  to  the  components  listed  in  the 
instruction.  The  #  character  always  precedes  the  value  of 
n  when  nsing  a  1  then  n  pattern  or  an  n  then  1  pattern. 

4.4  Programming  the,  Join,  Fork,  and 
Processing  components 

The  Fork,  Join,  and  processing  components,  T  and  E,  can 
be  programmed  in  a  similar  fashion.  Since  each  multiplier 


and  adder  is  composed  of  (out  pipeline  stages,  one  T  com¬ 
ponent  followed  by  three  E  components  are  needed  to  con¬ 
figure  one  multiplier  or  one  adder.  The  hist  pipeline  stage 
for  the  processing  component  must  use  the  T  component 
because  the  T  component  can  take  two  input  operands 
while  the  E  component  can  only  take  one  input  operand 
from  the  previous  pipeline  stage. 

5  Simulation  and  performance  study 

The  sustained  speed  for  executing  the  matrix  mniiiptica- 
tion  on  the  MCAP  architecture  shown  in  Pig.  1  is  stud¬ 
ied  through  the  simulation.  The  execution  time  for  each 
component  is  determined  from  an  earlier  paper  [5]  where 
an  MCM  implementation  of  the  MCAP  architeclore  us¬ 
ing  CMOS  technology  was  studied  and  analyzed.  From 
this  paper,  we  found  that  the  total  number  of  transistors 
needed  for  implementing  the  MCAP  architecture  shown  in 
Fig.  1  is  at  around  9.85  million  transistors  and  the  whole 
system  can  be  built  into  a  5  cm  by  S  cm  MCM  package  us¬ 
ing  CMOS  technology.  Based  on  this  study,  the  estimated 
execution  time  for  each  component  is  listed  below  [3]; 

a  Multiplier  and  adder  (64-bit  floating-point  processor); 
40  ns  per  pipeline  stage. 

•  Link  components:  8  to  10  ns 

a  Join  and  fork  components;  4  to  6  ns 

a  S  component  (memory  controller):  4  to  8  ns  for  exe¬ 
cution  time,  address  generation  time,  and  bos  access 
time 

a  Memory  module:  80  ns  (two- port  DRAM) 

Several  simulations  were  run  to  study  the  best  obtainable 
sustained  speed  for  matrix  multiplication  on  the  MCAP 
architecture.  The  sustained  rate  compared  to  the  peak 
speed  vs.  matrix  size  is  plotted  in  Fig.  5  and  the  actual 
MFLOPS  vs.  matrix  size  is  plotted  in  Fig.  6.  The  total 
clock  cycles  and  the  average  percentage  of  BUSY,  WAIT. 
IDLE,  and  FREE  states  for  the  multipliers  and  adders  is 
given  in  Table  1.  Many  interesting  results  can  be  derived 
from  the  simulation  data.  Note  that  the  peak  speed  of  the 
MCAP  architecture  shown  in  Fig.  1  is  at  200  MFLOPS 
with  four  multipliers  and  four  adders  using  the  above  ex¬ 
ecution  time  derived  from  [5]. 

5.1  Sustained  rate  in  the  MCAP  architec* 
ture 

In  order  to  achieve  a  very  high  sustained  rate  in  an  at¬ 
tached  processor,  the  bottleneck  component  needs  to  be 
identified,  so  that  the  system  performance  can  be  im¬ 
proved.  From  the  simulation  study,  the  bottleneck  compo¬ 
nent  in  the  MCAP  architecture  moves  from  one  component 
to  another  component  when  the  address  generation  time 
for  the  S  component  is  reduced. 

For  example,  the  lower  two  curves  in  Fig.  5  showed  the 
sustained  rate  remains  unchanged  when  the  address  gen¬ 
eration  time  for  S004  is  reduced  from  6  ns  to  4  ns  while 


the  execution  time  for  all  other  components  is  unchanged. 
If  the  address  generation  time  of  S004  is  set  at  '  ns  while 
reducing  the  execution  time  of  other  components,  such  as 
the  Join,  Fork,  or  Link  components,  no  improvement  on 
the  sustained  rate  was  found.  However,  if  the  address  gen¬ 
eration  time  of  S004  is  set  at  4  ns  and  the  execution  time 
of  J006  is  reduced  from  6  ns  to  S  ns,  the  highest  sustained 
rate  is  increased  from  83.3%  to  90.9%  as  shown  in  Fig.  5 
Further  reducing  the  execution  time  of  '  to  4  ns  in¬ 
creases  the  sustained  rate  to  94.6%  (c  res  with  a 

size  of  128  x  128.  Also,  if  the  address  '  .ion  time  of 
S004  is  fixed  between  4  ns  and  6  ns,  no  improvement  on  the 
sustained  rate  was  found  by  reducing  the  memory  access 
time. 

Based  on  the  simulation  data  collected  for  matrices  with 
sizes  ranging  from  4  x  4  to  128  x  128,  we  are  able  to  pre¬ 
dict  the  sustained  rate  for  larger  matrices.  The  simulation 
results  and  estimated  data  are  given  in  Table  1.  Fi  m  this 
table,  the  sustained  rate  is  estimated  to  be  as  high  as  £6% 
for  matrices  with  a  size  of  1024  x  1024  or  larger. 


Sustained  rate  (%) 
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Fig.  5.  Sustained  rate  vs.  matrix  size  for  matrix 
multiplication. 


5.2  Performance  comparison  with  other 
high-performance  parallel  computers 

In  one  of  the  recent  issues  of  IEEE  Parallel  and  Distributed 
Technology,  a  thorough  performance  comparison  between 
various  high-performance  parallel  computers  was  made  ns- 
ii.0  the  LinPack  benchmark  [3].  Since  matrix  multiplica¬ 
tion  is  the  main  portion  of  the  computation  in  LinPack 


benchmark,  it  is  appropriate  to  compare  the  simulation 
results  obtained  in  this  study  to  the  results  reported  in  [3], 
In  [3],  it  showed  that  Cray  X-MP/l  with  peak  speed  at 
235  MFLOPS  achieves  the  highest  sustained  rate  at  51% 
(121  MFLOPS  actual  speed)  while  Cray  C-90/16  can  de¬ 
liver  the  highest  actual  spe^  at  479  MFLOPS  but  with 
only  3.1%  of  its  peak  spe^  of  15,238  MFLOPS. 

In  the  MCAP  architecture,  the  peak  speed  for  four  mul¬ 
tipliers  and  four  adders  is  at  200  MFLOPS  using  CMOS 
technology.  However,  by  matching  the  algorithm  to  the  ar¬ 
chitecture,  the  best  sustained  rate  can  be  as  high  as  94.6% 
for  matrices  with  a  sire  of  128  x  128.  The  actual  speed  ob¬ 
tainable  from  the  MCAP  architecture  is  at  190  MFLOPS 
which  is  within  the  same  range  as  Cray  Y-MP/1  (145 
MFLOPS)  or  X-MP/4  (178  MFLOPS).  Note  that  both 
Cray  computers  are  designed  using  the  ECL  logic  and  the 
cost  for  both  machines  is  several  million  dollars,  while  the 
MCAP  architecture  is  based  on  the  much  cheaper  CMOS 
technology. 

Comparing  with  microcomputer  or  workstations,  MCAP 
has  the  following  advantages.  First,  it  is  very  easy  to  con¬ 
struct  an  MCAP  architecture  to  match  an  algorithm  so 
that  a  high  sustained  rate  can  be  obtained.  Second,  the  S 
component  can  be  programmed  in  ahead  of  time  for  a  new 
algorithm  before  completing  the  current  algorithm,  so  that 
the  instruction  fetching  time  can  be  overlapped  with  the 
execution  of  the  current  algorithm.  Third,  the  MCAP  ar¬ 
chitecture  can  be  scaled  up  to  include  10  to  20  processing 
components  to  achieve  a  peak  performance  between  200  to 
500  MFLOPS  using  CMOS  technology.  Lastly,  the  MCAP 
can  be  constructed  from  a  few  basic  components  and  it's 
architecture  is  much  simpler  than  any  moderei  micropro¬ 
cessor  or  workstations. 

6  Conclusions 

The  architecture  to  implement  a  class  of  high-performance 
attached  processors,  which  can  be  modularly  configured 
to  match  given  sets  of  algorithms,  has  been  presented. 
The  high  utilization  rate  of  the  processing  components  is 
achieved  mainly  by  (1)  minimizing  the  movement  of  in¬ 
termediate  results',  (2)  prefetching  almost  all  operands  us¬ 
ing  intelligent  memory  controllers;  and  (3)  reconfiguring 
(through  programming)  the  interconnection  of  the  process¬ 
ing  components  to  match  the  needs  of  a  given  algorithm. 
Moreover,  using  the  small  set  of  fundamental  components 
defined  for  the  MCAP  architecture,  it  is  possible  to  quickly 
prototype  an  MCAP  architecture  tailored  for  a  group  of 
specific  applications. 

A  set  of  simulation  tools  has  been  developed  to  evaluate 
the  performance  of  the  MCAP  architecture.  Through  sim¬ 
ulation  studies,  we  found  that  it  is  possible  to  achieve  a 
very  high  sustained  rate  by  matching  the  algorithm  to  the 
MCAP  architecture.  Thus,  although  the  peak  speed  of  the 
MCAP  architecture  may  not  be  very  high,  the  actual  sus¬ 
tained  speed  obtainable  on  the  MCAP  architecture  is  in 
the  same  range  of  many  supercomputers  such  as,  the  Cray 
X-MP/l  or  Y-MP/4.  Furthermore,  the  power  consump¬ 


tion  on  the  MCAP  architecture  is  estimated  at  25  W/cm* 
which  is  suitable  for  air  cooling  [5];  this  is  in  sharp  contrast 
to  most  supercomputers  where  liquid  cooling  is  required. 


Sustained  speed  (MFLOPS) 


Matrix  dimension  (n) 

Fig.  6.  Sustained  speed  vs.  matrix  size  for  matrix 
multiplication. 
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Abstract 

Tht  computational  intensitp  of  the  task  being  executed  is  an 
important  factor  in  determining  the  sustainable  through¬ 
put,  especially  for  modern  computers  with  hierarchical 
memories  and  highly  pipelined  processors.  This  paper  de¬ 
termines  the  computational  intensity  with  respect  to  the  in¬ 
ner  memory  capacity  for  several  computationally  intensive 
algorithms  that  have  wide  application.  It  also  analyses  the 
influences  of  computational  intensity  on  the  speed  and  cost 
of  hierarchical  memories.  Based  on  the  ana/y«ii,  a  method 
to  optimise  the  memory  cost  relative  to  the  memory  size 
and  speed  at  each  memory  level  is  also  presented: ' 

1  Introduction 

The  performance  of  a  computer  system  is  typically 
Umit^  by  the  speed  of  its  memory  and  cache  memory 
has  been  widelv  used  in  reducing  the  memory  access 
time  {!],  [2],  (3J.  In  the  two-level  memory  hierarchy 
shown  in  Fig.  1 ,  where  memory  accesses  must  be  made 
through  the  first-level  memory,  M  is  the  number  of  lo¬ 
cations  in  outer  memory,  m  is  the  number  of  locations 
in  inner  memory,  and  n.  is  the  number  of  floating¬ 
point  operations  involved  in  the  algorithm. 

The  average  access  time  is  ,  where 

h  is  the  hit  ratio,  and  (m  and  (jt#  are  the  access  times 
of  the  inner  and  outer  memory,  respectively.  The  in¬ 
ner  memory  of  this  two-level  hierarchy  can  be  imple¬ 
mented  in  a  variety  of  ways  and  may  represent  a  mem¬ 
ory  subhierarchy.  Obviously  the  hit  ratio  increases  as 
the  size  of  the  first-level  memory  grows.  However,  in 
addition  to  size,  the  hit  ratio  is  also  a  function  of  the 
algorithms  and  applications  involved.  An  algorithm 
that  exhibits  a  high  degree  of  locality  in  memory  ac¬ 
cesses  will  result  in  a  high  hit  ratio.  For  computation¬ 
intensive  algorithms,  this  programming  locality  can 
be  measured  as  the  computational  intensity,  which  is 
defined  as  the  average  number  of  floating-point  oper¬ 
ations  per  access  to  the  second  level  memory. 

a 
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Another  reason  to  investigate  computational  intensity 
is  that  it  directly  affects  the  average  performance  of  a 
pipelined  processor.  For  a  highly  pipelined  machine, 
main  memory  becomes  the  bottleneck  of  the  system. 
The  speed  imbalance  between  the  memory  and  pro¬ 
cessing  elements  causes  the  average  performance  to 
be  substantially  lower  than  the  peM  performance  de¬ 
signed  into  the  processor.  Sustainable  computation 
rates  of  typical  supercomputers  have  been  evaluated 
and  reported  by  Tang  and  Davidson  [4]  and  found  to 
be  much  lower  than  the  sustainable  rates  for  most  id- 
gorithms.  A  high  computational  intensity  reduces  the 
rate  of  memory  access,  thus  improving  the  average 
performance.  Hockney  and  Jesshope  have  shown 
the  average  performance  as  a  function  of  the  compu¬ 
tational  intensity  for  nonoverlapped  as  well  as  over¬ 
lapped  memory  transfer  and  arithmetic  operation. 


Figure  1.  Two-level  memory  hierarchy  structure. 


In  thU  paper,  we  investigate  the  computational  in¬ 
tensities  for  some  of  the  frequently  used  algorithms 
and  their  impact  on  the  design  of  hierarchical  memory 


systems.  Section  2  analyzes  several  computationally 
intensive  algorithms,  including  matrix  multiplication, 
matrix  inversion,  and  solutions  to  linear  and  partial 
differential  equations.  For  these  algorithms,  analyti- 
czd  methods  are  developed  to  evaluate  their  compu¬ 
tational  intensities  with  the  inner  memory  capacity 
and  problem  size  as  the  parameters.  Sections  3  and 
4  consider  possible  applications  of  computational  in¬ 
tensities  in  the  design  of  hierarchical  memory  systems. 
Expressions  for  optimizing  a  hierarchy  with  respect  to 
speed  and  cost  are  given  in  terms  of  computational 
intensities. 

2  Case  Studies  of  Computational  In¬ 
tensities 

la  this  section  the  computational  intensities  as  functions 
of  internal  memory  sire  and  problem  sire  ate  determined 
for  several  example  algorithms.  When  m  =  0,  only  the 
pipeline  buffer  registers  are  available  for  internal  storage. 
In  some  cases,  chaining  of  processing  components  is  as¬ 
sumed.  In  deriving  the  required  number  of  accesses  to  the 
outer  memory  level  it  is  assumed  that  the  inner  memory 
is  entirely  usable  (e.g.,  it  is  fully  associative). 

2.1  Matrix  Multiplication 

As  discussed  in  (5),  there  are  three  major  algorithms  for 
matrix  multiplication,  the  inner  product,  middle  product, 
and  outer  product  algorithms.  All  three  algorithms  require 
the  same  number  of  operations.  However,  the  sequences  of 
computation  are  different  for  the  three  approaches. 

The  analysis  assumes  that  each  matrix  is  an  n  x  r»  matrix 
and  the  multiplication  is: 
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Hence,  c,j  is  computed  as  follows; 


e.,  =  ^  o,*  X  bk, 

kml 

and  the  total  number  of  computations  i8nxnx(2n  — 1)  = 
2n*  -  n*.  which  is  the  same  for  all  three  algorithms. 

The  inner  product  method  computes  the  elements  of  C  in 
sequence.  This  is  implemented  in  a  high-level  languages  as 


for  i  s  1  to  n  do 
for  jssl  to  a  do 
for  k  si  to  n  do 

Cri.jl  =  Cti.jl  + A[i,k]*B[k.j); 


The  sequence  of  computations  is  cn  =  an  x  in,  cn  s 

Cn  +  «ii  X  iai,  cn  =  cn  -f-  «i3  x  . . .  cn  =  cn  + 

am  *  4«i.  cij  =  an  x  in.  cn  =  cu  +  an  x  ijj,  . . . , 

Cij  =  Cij  +  am  X  in3,  .... 

Since  this  method  computes  one  element  c,,,  i.e.,  one  inner 
product,  before  computing  the  next  clement,  it  is  advan¬ 
tageous  to  keep  e,,  in  the  inner  memory  to  reduce  accesses 
to  the  outer  memory.  Also  shown  in  the  above  compu¬ 
tation  sequence  is  that  in  computing  each  row  of  matrix 
C,  all  elements  in  matrix  B  ate  referenced  once.  On  the 
other  hand,  in  the  same  computation,  only  one  row  of  ma¬ 
trix  A  is  referenced  and  it  is  used  n  times.  Therefore, 
when  the  available  storage  m  is  between  1  and  n  -I- 1,  the 
best  possible  allocation  is  to  store  the  matrix  C  elements 
currently  being  computed  and  m  —  1  matrix  A  elements. 
Based  on  this,  the  computation  of  each  row  in  matrix  C 
requires  n*  fetches  and  n  stores.  For  a  row  of  c’s,  there 
are  m  —  1  fetches  for  the  a’s  kept  in  the  inner  memory 
plus  (n  —  m  -f  l)n  fetches  for  the  remaining  n  —  m  -f  1 
a’s  from  the  outer  memory.  The  total  number  of  mem¬ 
ory  accesses  is  ((n*  -fn-fm  —  l-f(n— m-f-  l)n]n  = 
2n*  2n*  -  mn*  +  mn  -  n.  Thus,  the  computational 

intensity  is  (2n^  —  ii*)/(2n*  -f  2n*  —  mn’  -f  mn  —  n). 

Once  the  storage  exceeds  n  -f  1,  there  is  no  advantage  in 
storing  more  than  one  row  of  a’t.  This  is  because  after 
one  row  of  c’s  is  computed,  the  same  row  of  o’s  will  not 
be  used  again.  Therefore,  when  the  inner  storage  site  is 
between  n  -fl  and  (n  -H)n  -H,  the  inner  memory  should 
keep  one  c,  one  row  of  a’s  and  as  many  columns  of  b's  as 
the  remaining  space  allows.  These  columns  of  b'$  will  be 
used  in  computing  each  row  of  c’s.  For  the  case  tn-f-n-f  1, 
k  columns  of  b’$  can  be  kept  in  the  inner  memory.  The 
total  number  of  accesses  is  n’  -f  n*  -f  kn  +  (n  —  k)n*  ss 
n*  -»-  2  »*  -  in*  +  ki».  This  includes  «*  fetches  of  a’s,  s’ 
stores  of  c’s,  kn  fetches  of  b'$,  that  are  kept  in  the  inner 
memory  plus  (n  -  k)»»*  fetches  for  the  remaining  b's.  Table 
I  summarizes  computational  intensities  for  various  storage 
sizes. 
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Table  1  Computational  intensities  for  the  inner  product 
method. 


The  middle  product  method  computes  an  entire  col¬ 
umn  of  c’s  simultaneously,  thus  allowing  up  to  n  pro- 


cessors  to  compute  in  paraiiei.  In  a  high-level  lan¬ 
guage,  this  can  be  implemented  as 

for  j  =  I  to  n  do 
for  k=l  to  n  do 
for  i  =1  to  n  do 

C[i.j]  :=C[i.j)-»- A(i,kl* 

The  sequence  of  computations  becomes  cij  =  oitxkij, 

Cjj  =  Oji  X  blf,  Cij  =  flaj  X  bij . Cnj  =  Unl  X  bij, 

ClJ  =  Cij  -I-  Oij  X  b2j,  Cjj  =  Cj)  -I-  fljl  X  bij,  Cif  = 
Cij  +  X  ^2y. - 

Since  each  frtj  is  used  n  times  and  may  then  be  dis¬ 
carded,  keeping  bkj  in  inner  memory  will  reduce  the 
number  of  memory  accesses.  For  I  <  m  <  n  -I-  1, 
the  remaining  locations  store  m  —  1  c’s.  Based  on 
this  organization,  the  total  number  of  memory  ac¬ 
cesses  required  in  computing  one  column  of  c’s  is 
n(n-f  2)-f2(n-  m-H)(fi-  1).  The  first  term  includes 
n  writes  for  storing  a  column  of  matrix  C-  The  sec¬ 
ond  term  represents  the  number  of  accesses  required 
to  save  and  retrieve  partial  sums  for  the  remaining 
n  —  m  -I-  1  c’s  that  are  not  kept  in  the  inner  mem¬ 
ory.  Therefore,  the  entire  matrix  multiplication  re¬ 
quires  n{3  n’  2n  -  2mn  -h  2m  -  2)  memory  accesses. 

For  a  memory  size  of  in  -f  n  -I- 1,  in  addition  to  one  6 
and  one  column  of  c’s,  i  columns  of  n’s  can  be  kept  in 
memory.  This  leads  to  the  following  total  number  of 
accesses  for  the  entire  matrix  multiplication: 

+  kn  +  {n  -  k)n^ 

The  first  term  is  to  fetch  b's,  the  second  term  to  store 
c’s,  the  third  term  to  fetch  i  columns  of  a’s  and  the 
last  term  to  fetch  the  remaining  o’s  n  times.  Compu¬ 
tational  intensities  for  the  middle  product  algorithm 
are  summarized  in  Table  2. 


Table  2  Computational  intensities  for  the  middle  product 
method. 


The  outer  product  algorithm  further  increases  the  de¬ 
gree  of  parallelism  by  computing  all  n’  c  elements  at 
the  same  time.  This  allows  up  to  n’  procesrcrs  to 
compute  the  matrix  multiplication  in  parallel.  In  a 
high-level  language,  this  algorithm  is  implemented  as 


for  k  =  1  to  n  do 
for  i=i  to  n  do 
for  j  =  1  to  n  do 

C(i.jl  :=  C{.,j]-t- A(i,k)*  B[k,j); 

The  sequence  of  computations  becomes  cn  =  c,i  -i- 

UU  ^  kkl,  CiJ  =  Cii  +  Oi*  X  6*2.  Cl3  =  Ci3  +  an  X  6*3, 
.  =  Cln  +  an  X  bin,  CH  =  Cji  -f  flj*  X  6*i, 

fjz  =  Cii  +  an  X  6*2, 

For  1  <  m  <  n  -f-  1,  one  a  and  m  -  I  6’s  should  be 
internally  stored.  This  leads  to  the  total  number  of 
accesses  3  n*-(m-l)n*-i-(m-  lln.  For  a  memory  size 
of  in-t-n-l- 1,  the  inner  memory  should  store  one  a,  one 
row  of  b’s  and  k  rows  of  c’s.  The  resulting  number  of 
memory  accesses  in  this  case  is  2  n®  -i-  n*  -  2tn*  +  2kn. 
Table  3  shows  computational  intensities  for  various 
memory  sizes  based  on  the  outer  product  algorithm. 


Table  3  Computational  intensities  for  the  outer  product 
method. 


The  computational  intensities  of  these  three  matrix 
multiplication  algorithms  are  compared  in  Fig.  2.  The 
comparison  is  based  on  the  matrix  size  1000  x  1000, 
i.e.,  n  =  1000.  As  seen  in  this  figure,  the  inner  prod¬ 
uct  algorithm  yields  the  highest  computational  inten¬ 
sity  among  the  three,  followed  by  the  middle  prod¬ 
uct  method,  and  the  outer  product  algorithm  is  the 
lowest.  The  reason  is  that  the  inner  product  method 
completes  the  sequence  of  computing  one  inner  prod¬ 
uct  before  starting  the  next.  This  computation  se¬ 
quence  involves  at  most  one  row  of  matrix  A  and  one 
column  of  matrix  B  at  a  time.  On  the  other  hand,  the 
outer  product  method  provides  the  maximum  degree 
of  parallelism  among  the  three  methods.  Since  this 
method  computes  all  n?  vector  product  terms  at  the 
same  time,  it  requires  access  to  all  elements  in  ma¬ 
trix  A  and  B  before  any  vector  product  is  completed, 
thus  yielding  the  lowest  computational  intensity  for 
the  three  approaches  (about  half  of  that  for  the  inner 
product  method)  when  the  storage  size  is  less  than  ti*. 
Once  the  storage  size  is  above  n^,  the  inner  product 
method  has  no  advantage.  When  the  storage  size  is 
below  n,  the  inner  product  method  is  also  better  than 
the  middle  product  method.  This  is  because  the  mid¬ 
dle  product  method  computes  n  vector  product  terms 


at  a  time,  thus  requiring  accesses  to  n  operands  at  a 
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Figure  2.  Computational  intensity  vs.  inner  memory 
sizes  for  matrix  multiplication 

2.2  Matrix  Inversion 

For  inversion  of  an  arbitrary  n  x  n  nonsingular  mar 
trix  (n  >  2)  the  pivot  method  fOj  for  which  the  pivot 
is  determined  by  scanning  a  column  for  its  maximum 
magnitude  is  assumed.  Only  one  colurrui  is  scann^ 
for  each  pivot  and  the  n  pivots  are  determined  in 
succession  using  the  columns  from  left  to  right.  The 
approximate  size  of  the  outer  memory,  M,  must  be 
at  least  n(n  +  1).  The  number  of  flops,  Up,  corre¬ 
sponding  to  each  pivot  is  n  comparisons,  n  divisions 
(or  one  reciprocation  and  n  multiplications),  n(n  —  1) 
subtractions,  and  n(n  —  1)  multiplications  so  that 
rip  =  [2  n  -I-  2  n(n  —  1)]  n  =  2  n* 

Computational  intensities  for  several  sizes  of  the  in¬ 
ner  memory  are  given  in  Table  4.  Row  interchanges 
have  been  ignored.  As  an  example  of  computing  the 
number  of  memory  accesses,  consider  the  third  entry 
in  the  second  column.  For  each  column,  the  n  column 
elements  are  input  one  at  a  time  and  the  maximum 
element  is  determined.  Then  for  each  group  of  m  —  I 
elements  in  the  pivot  row,  the  elements  are  input  and 
divided  by  the  pivot  element.  Then  for  each  of  the 
other  n  -  1  tows,  the  element  in  the  pivot  column 
and  m  -  1  other  elements  are  input.  After  m  -  1 
new  values  are  computed,  they  are  output.  Finally, 
the  m  -  1  new  elements  in  the  pivot  row  are  output. 
Therefore,  the  number  of  accesses  for  each  of  the  n 
columns  is  n  -H  {2(m  -  1)  -1-  (2  m  —  l)(n  -  1)]  ~ 

n®  4-  ^=2  n  Also,  the  computational  intensity 
as  a  function  of  m  for  n  =  1000  is  one  of  the  curves  in 
Fig.  3. 
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Table  4  Computational  intensities  for  matrix  inversion. 


2.3  Partial  differential  Equations 

For  examining  the  solution  of  partial  differential  equa¬ 
tions,  two-dimensional  equations  are  assumed  and  the 
nearest  neighbor  approach  is  used  [6],  [2].  Suppose 
that  the  area  of  the  solution  is  represented  by  an  n  x  p 
array  surrounded  by  2  (n  +  p)  boundary  points  that 
are  known.  If  the  dependent  variable  is  v  and  /  is  a 
known  function,  then  there  are  constants  a,  6,  c,  d,  and 
csuch  that  Vi)  =  avi_ij-btvi+ij -»-cvij_i-»-dv, 
e/ij  1  =  1 . p,  ;=l,...,n 

The  solution  is  found  by  f  iterations  over  the  entire  ar¬ 
ray  by  incrementing  t  and  j  and  using  the  new  values 
of  Vi j  as  they  are  computed.  So  that  the  outer  mem¬ 
ory  can  contain  the  constants  and  all  values  of  Vi)  and 
/i),  the  size  of  M  must  be  at  least  2(np  -f  n  -j-  pi  -p  5. 
First  np  multiplications  are  required  to  replace  all  /i^s 
with  e/ij  and  then,  for  each  iteration,  four  multipli¬ 
cations  and  four  additions  are  required  for  each  point. 
Therefore,  np  =  (89+l)np»8  npq. 

Table  S  gives  the  computational  intensities  for  several 
values  of  m  and  one  of  the  curves  in  Fig.  3  gives  the 
computational  intensity  as  a  function  of  m  for  n  = 
p  =  1000  and  q  =  20.  For  example,  the  fifth  entry 
in  the  second  column  of  Table  5  was  found  for  k  =  0 
assuming  the  five  weighting  constants  are  input  and 
then  pn  values  of  fij  are  input  and  pn  values  of  e/,y 
are  output.  Then  for  each  iteration  the  tfij  products, 
2(n  -p  p)  boundary  points,  and  old  values  of  the  np 
interior  points  are  input,  and  the  new  values  of  the 
np  interior  points  are  output.  For  it  >  0  there  are 
k  points  that  need  to  be  input  and  output  only  once 
instead  of  q  times. 

2.4  Linear  equations 

Simultaneous  linear  equations  can  be  solved  efficiently 
by  the  Gaussian  elimination  method  [6].  In  the  first 
phase  of  the  Gaussian  elimination,  n  simultaneous  lin¬ 
ear  equations  with  n  variables  are  reduced  to 

Oii*i+aiJ*J+  -  +  o'in*n  =  C{ 
a'ja*2  +  •  •  •  +  a'2„x„  =  CJ 


C'n  (1) 


In  the  second  phase,  variables  zi  to  x„  can  be  obtained 
by  the  following  computations; 


** 


*kk 


The  total  number  of  computations  in  phase  1  is 


n-l 

3(n(n  -  1)  +  (n  -  l)(rj  -  2)  +  •  •  +  2]  =  3  i(i  +  1) 

i=j 


=  -  l)(2n  -  1)  +  |n(n  -  1)  =  -  n. 


The  number  of  computations  in  Eq.  (2)  is 


1  +  3+  ••  +  2n- 1  =  52(2*  -  l)  =  n^  (3) 

>sl 

Therefore,  the  total  number  of  computations  for  solv¬ 
ing  n  simultaneous  linear  equations  is 

C  =  +  n*  -  n  (4) 

The  number  of  accesses  to  outer  memory  and  corre¬ 
sponding  computational  intensities  are  given  in  Ta¬ 
ble  6.  Also,  see  Vi  3  for  the  computational  intensity 
as  a  function  of  m  for  n  =  1000.  The  number  of 
memory  accesses  for  reducing  n  simultaneous  linear 
equations  to  Eq.  (1)  in  the  case  of  m  a  0  is  derived  as 
follows.  Since  there  is  no  internal  memory  in  the  pro¬ 
cessing  units,  all  intermediate  results  must  be  stored 
back  to  memory  and  read  in  for  later  computations. 
The  number  of  memory  accesses  is 


7(n(n  - 


n— I 


l)  +  (n-l)(n-2)+  •  +2]  =  7  52i(‘'+l) 

•=i 

^n(n  -  l)(2n  -  1)  +  ^n(n  -  1)  =  jn(n’  -  1) 


And  the  number  of  memory  accesses  in  Eq.  (2)  is 
n-fl  n-l 

3n  +  ^  ^  i  +  3  ^  I  =  3n  +  (n  —  l)(n  +  2) 

•=3  i  =  l 

=  n*  +  4n  -  2,  (6) 

Summing  Eqs.  (5)  and  (6),  the  total  number  of  mem¬ 
ory  accesses  is 


+  nnr«-2. 


(7) 


thus  eliminating  two  outer  memory  accesses,  e  g.,  one 
write  and  one  read  of  the  intermediate  product.  So, 
the  computational  intensity  is  7/5  times  higher  than 
the  case  when  m  =  0.  In  the  case  of  m  =  3,  one  can 
store  the  most  often  used  coefficients  and  a,t  in 
the  inner  memory  to  further  reduce  the  outer  memory 
accesses  to  three  accesses  per  loop.  When  more  inner 
memory  is  available,  one  can  store  all  the  coefficients 
in  the  inner  memory,  thus  the  minimal  number  of 
outer  memory  accesses  will  be  equal  to  n(n  + 1).  Since 
the  total  number  of  computations  is  0{n^),  the  com¬ 
putational  intensity  approaches  n  when  m  =  n(n  + 1). 
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Table  S  Computational  intensities  for  partial  differential 
equation  solution. 
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For  the  case  of  m  =  1,  the  intermediate  product  in 
the  innermost  loop  can  be  stored  in  the  inner  memory. 


Table  6  Computational  intensities  for  solving  Unear 
eqnations  using  Gaussian  elimination. 


Inner  memory  sizes 


Figure  3.  Computational  intensity  vs.  inner  memory 
sizes  for  linear  equations,  matrix  inversion,  and 
partial  differential  equations. 

3  Time  Analysis 

One  ose  of  computational  intensity  is  in  time  analysis  for 
whkb,  at  any  level  i  in  the  hierarchy,  the  memory  cora- 
mnnicates  with  only  the  memories  at  the  t  —  1  and  •  4- 1 
levels.  Consider  a  hierarchy  of  r  4- 1  levels.  If 

s  m,  =  number  of  locations  in  the  ith  memory  level, 
i  =  0, 1, . . . ,  r,  where  the  0th  level  includes  only  pro¬ 
cessing  (i.e.,  mo  =  0). 

s  n,  =  number  of  operand  accesses  from  the  (t  4-  l)th 
level  to  the  ith  level,  i  =  0, 1, . . . ,  r  —  1. 
s  Sp  =  processor  speed  in  megaflops  per  second, 
s  S)  =  memory  access  speed  in  raegaoperands  per  sec¬ 
ond  of  level  i  and  includes  the  miss  determination 
time  associated  with  level  i-l,i  =  l,...,r. 

•  ~  computational  intensity  at  the  boundary 

between  levels  i  —  1  and  i,  i  =  1, . . . ,  r,  as  a  function 
of  the  size  of  memory  inside  that  boundary. 


where  R,  =  Sp/s,  and  /,-i  =  ^  =  /(m,_i ),  i  =  I, . . . ,  r. 

Now  suppose  that  instead  of  using  ordinary  cache  at  level 
j,  it  is  assumed  that  the  memory  at  this  level  can  prefetch 
operands  from  the  outer  levels  and  automatically  supply 
them  to  the  inner  levels.  Then  there  can  be  overlapping  of 
processing  and  memory  accessing  in  the  inner  levels  with 
the  memory  accessing  in  the  outer  levels.  This  overlapping 
implies  that  the  summation  in  (8)  can  be  replaced  with 
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j  =  1 

j  =  2, . . .  ,f 


In  the  extreme,  for  which  prefetching  is  done  at  all  levels 
so  that  overlapping  is  maximized 

T  i  1  fir  , 

Ti,  =  maz{l,  -r-,  -  ,  7 - } 

Jo  /r-l 

This  is  minimized  when  7^  <1  or  s»  <  s, /,_i,  1  = 

1, . . .  ,r  For  the  slowest  speeds,  s,  =  Sp//,-i ,  •  =  1, . . .  ,r. 

Next  let  us  assume  no  prefetching,  but  that  the  (j  ~  l)th 
level  can  access  the  (j  4-  l)th  level  directly.  For  input, 
operands  are  stored  in  the  (y  —  l)th  and  yth  levels  simul¬ 
taneously  and  for  output,  operands  are  stored  in  the  yth 
and  j  4-  1th  levels  simultaneously.  Then 


Sp  *  Ji  ^  ^  /i  — 1 


In  the  extreme  for  which  all  memories  can  access  the  pro¬ 
cessor  level  at  the  same  time  as  the  other  memory  levels 
and  there  is  prefetching  and  direct  accessing  at  all  levels 


The  minimum  of  Tn  for  the  slowest  memory  speeds  occurs 
when 


,  1  =  1,..,,  r-l,  and  Sr  = 

Ir-l 


4  Cost  Minimization 

Another  use  of  the  computational  intensity  is  in  the  min¬ 
imization  of  the  cost  of  a  processing  system  that  includes 
a  memory  hierarchy  of  r  4-  1  levels.  The  total  cost  is 


then  the  time  required  to  complete  an  algorithm,  assum¬ 
ing  all  memory  levels  except  the  outer  level  operate  like 
ordinary  caches,  is 


tm  ^  S, 
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Let  T»  be  the  normalized  quantity 


/t-i 

^  •«! 


(») 


c  =  Cp(sp)  4- ^  Cm(r,,m.),  m,  =  M 

where  Cp(sp)  is  the  cost  of  the  processing  logic  as  a  function 
of  processing  speed  and  Cm  is  the  cost  of  memory  as  a 
function  of  memory  access  speed  in  operands/s  and  size  in 
operand  locations.  Now  suppose  that  c  is  to  be  minimized 
relative  to  the  constraints  that  a  given  algorithm  must  be 
performed  within  a  specified  time  T,  Sp  >  0,  s,  >  0  for  all 
t  and  m,  >  0  for  all  i.  The  time  constraint  normalized 


by  np,  assuming  no  overlapping  of  processing  and  memory 
accessing,  is 


r 


t») 


Fot  «  te^islic  system,  c,  sad  Cm  sre  strictly  monotonic 
increasing  positive  functions  (M(PFs)  that  approach  in¬ 
finity  as  speed  approaches  infinity.  Also  Cm  is  a  strictly 
MIFF  that  approaches  infinity  as  memory  size  approaches 
infinity.  Therefore,  the  minimum  c  must  occur  when  the 
inequality  becomes  an  equality  and 

■—  >  0 

But,  Sp  is  a  strictly  monotonic  decreasing  positive  func¬ 
tion  (MDPF)  of  each  s,  and,  hence,  c,(s,)  is  a  strictly 
MDPF  of  each  s..  This  implies  that,  for  each  s„  e  is  the 
sum  of  a  strictly  MDPF  and  a  strictly  MIPF.  Therefore, 
c  has  a  minimum  relative  to  the  s,s  that  occurs  at  ex¬ 
actly  one  point  (s',, . . . , s',).  Similarly,  because  I(m)  is  a 
MIPF  of  m,  c  is  known  to  have  a  minimum  relative  to  the 
miS.  Moreover,  because  /(m)  may  not  be  strictly  mono¬ 
tonic  increasing,  the  minimum  may  occur  at  several  points. 
But  there  is  at  least  one  point  (s'|,...,s',,m',,...,mr_|) 
at  which  c  is  a  minimum.  The  speed  Sp  can  be  determined 
from  Eq.  (10)  and  m,  sz  M  :3  given. 

As  an  example,  let  us  assume  that 


r 

c=:Ct,  +  '^/(T)s.(Am.  +  B),  (11) 

tml 

where  A,  S,  Mad  C  ue  con8t4nU  4nd 

1  -  ^  1  j  >  ,o„,  • 

and  minimize  the  cost  of  the  memory  hierarchy  relative  to 
the  inner  product  algorithm  fot  multiplying  1000  x  1000 
matrices.  The  factor  /(T)  is  to  compensate  for  a  change 
to  a  very  fast  technology.  Suppose  that  C  =  5  x  10~*  dol¬ 
lars  per  flops/s,  A  =  10“"  dollars  per  flops/s  per  operand, 
8  =  10~*  dollars  per  flops/s,  Af  =  1.6  x  10^  operands,  1, 
200,  500,  1001,  2001,  3001,  6001,  251001,  501001,  7510C1, 
and  10001000  operands  are  the  possible  memory  sizes,  and 
10,  15,  20,  25,  30,  50,  100,  200  and  400  M  operands/s  are 
the  possible  memory  speeds..  Then  the  minimum  total 
cost  as  a  function  of  actual  processing  speed  (as  opposed 
to  the  speed  of  the  processor)  is  as  shown  in  Fig.  4.  This 
figure  has  three  curves  and  ea^  cure  corresponds  to  a  num¬ 
ber  of  levels.  A  four-level  hierarchy  was  also  considered, 
but  in  no  case  did  the  fourth  levd  permit  a  reduction  in 
the  total  cost. 

The  above  has  assumed  nonovetlapping  of  processing  and 
memory  accessing.  If  we  now  assume  that  the  memory 
at  level  j  can  prefetch  operands  from  the  outer  levels  and 
automatically  supply  them  to  the  inner  levels,  then  there 


can  be  overlapping  of  processing  and  the  memory  accessing 
of  the  inner  levels  with  the  memory  accessing  of  the  outer 
levels.  This  overlapping  implies  that  the  summation  in 
inequality  Eq.  (9)  can  be  replaced  with  the  maximum  of 
two  summations  and 


For  y  =  2, . . . ,  r  -  1,  assume  that 
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when  the  cost  c  is  at  its  minimum.  But,  without  increasing 
T,  less  memory  could  be  used  to  construct  ’he  inner  mem¬ 
ories  and  the  cost  could  be  reduced,  thus  contradicting 
the  original  assumption.  Similarly,  if  inequality  (12)  is  re¬ 
versed  the  cost  could  be  reduced  by  reducing  the  amount  of 
outer  memory  without  increasing  T.  Therefjre,  the  mini¬ 
mum  c  occurs  when  inequality  (12)  is  replaced  by  an  equal¬ 
ity  and  both  sides  are  equal  to  T.  Similar  arguments  could 
be  made  for  the  cases  y  =  I  and  y  =  r.  This  implies  that 
the  minimum  occurs  when 


^"1  ^ -he;:.* 771^ =!::=, rib  ^=2 . 

Therefore,  two  of  the  speed  variables,  say  Sp  and  Sr,  can 
be  expressed  in  terms  of  T  and  eliminated  from  the  mini¬ 
mization  process. 

In  the  extreme,  complete  overlapping  for  which  all  mem¬ 
ories  operate  independently  in  supplying  and  storing  the 
needed  operands  may  be  assumed.  In  this  case  the  mini¬ 
mum  occurs  when 


7= 1=_L 

Sm  5| ip 


Sr  lr-\ 


and  all  of  the  speed  variables  can  I  e  expressed  in  terms  of 
the  T  and  the  computational  intensities,  and  the  expres¬ 
sion  to  be  minimized  becomes 

1  b  I  \ 

c  =  Cpi^)  +  2^  j  - - ,m,  1  ,  mr  =  M 

Because  Cp(l/T)  is  a  constant  for  a  given  value  of  T,  it 
may  be  eliminated  from  the  minimization  proce.ss.  If  it  is 
assumed  that  Cm  =  Dt(m+B],  D  >  0,  then  the  expression 
to  be  minimized  is 


Am,  +  B 


mr  =  M 


and  the  minimal  cost  is 


»  1  V  .  O  ^  Am,  +  B  „ 


If  overlapping  and  the  same  cost  function  with  the  same 
parameters  used  to  generate  Pig.  4  are  assumed,  then  the 
minimum  total  cost  as  a  function  of  the  number  of  levels  is 
shown  in  Pig.  S  for  four  problem  sizes.  The  memory  sizes 
were  limited  as  before,  but  the  memory  speeds  were  not 
limted.  Note  that  from  Ex).  (11)  the  total  cost  increases 
with  the  processing  speed,  which  was  assumed  to  be  100 
Mflops/t  while  generating  Pig.  5.  Also,  note  that  the  total 
cost  depends  very  little  on  problem  size  and  there  is  no 
gain  in  using  four  memory  levels. 

S  Summary  and  Conclusions 

In  this  paper,  the  computational  intensity  for  various  algo¬ 
rithms  are  investigated.  Computational  intensity  directly 
affects  the  hit  ratios  in  a  hierarchical  memory  system  and 
is,  therefore,  a  major  factor  on  memory  performance. 

This  paper  also  develops  expressions  showing  the  relation¬ 
ship  between  the  computational  intensity  and  the  speed 
and  cost  of  hierarchical  memory  systems.  These  expres¬ 
sions  can  be  used  as  design  tools  in  determining  the  opti¬ 
mal  size  and  speed  for  each  memory  level  without  relying 
on  time-consuming  simulations. 

A  cost  minimization  example  is  presented  which  analyzes 
a  hierarchy  for  both  the  overlapped  and  nonoverlapped 
cases.  Although  the  data  resulting  from  only  one  algo¬ 
rithm  and  one  set  of  cost  parameters  is  shown,  a  variety  of 
algorithms  and  cost  parameters  were  examined  and  it  was 
noted  that  in  none  of  trial  minimizations  did  four  memory 
levels  show  a  significant  cost  improvement  over  three  mem¬ 
ory  levels.  However,  for  the  nonoverlapped  trials  the  mem¬ 
ory  speeds  were  limited  to  10  M  operands/s  and  above. 
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Figure  4.  Total  cost  vs.  processing  speed  for  various 
numbers  of  memoty  levels. 
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Figure  5.  Total  cost  vs.  memory  levels  with  overlapping. 
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Abstract 

Low  yield  is  one  of  the  practical  difficulties  in  the  design  of  WSI  systems,  such  as  array 
processors  or  WSI  memories.  The  conventional  row-column  memory  cells  organization  is  not 
suitable  for  WSI  memory  systems  due  to  the  long  signal  delay  on  a  wafer  and  a  much  more 
complicate  procedure  for  replacing  a  defect  row  or  column  of  memory  cell.  To  alleviate  these 
difficulties,  a  module-sliced  WSI  memory  system  is  proposed  for  high  yield  WSI  memory  sys¬ 
tems.  The  basic  unit  of  the  WSI  memory  system  is  a  module  which  consists  of  a  memory  bank, 
a  module  comparator,  a  module  register,  and  a  row-column  decoder.  The  WSI  memory  system 
is  organized  in  a  two  level  row/column  structure.  The  first  level  is  a  two  dimensional  mesh 
with  the  basic  unit  of  a  module.  Within  each  module,  i.e.  the  second  level,  the  memory  bank 
is  organized  in  a  conventional  rows  and  columns  of  memory  cells. 

The  most  important  feature  of  the  proposed  WSI  memory  system  is  that  the  reconfigu¬ 
ration  of  the  faulty  memory  system  into  a  fault-free  memory  system  is  done  straightforward 
without  employing  any  reconfiguration  adgorithm  by  using  the  module  address  in  each  module. 
Each  module  in  the  WSI  memory  system  is  a  complete  memory  system  and  its  operation  is 
independent  of  any  other  module.  An  effective  module  address  stored  in  the  module  register 
can  activate  a  module  if  its  fault-free,  otherwise  a  dummy  address  can  be  stored  in  the  module 
register  to  bypass  a  faulty  module  without  the  needs  for  reconfiguration.  Since  the  module 
comparator,  the  register,  and  the  row-column  decoder  inside  a  module  are  the  extra  hardware 
required  for  the  module-sliced  WSI  memory  system  which  will  in  turn  decre^ise  the  yield  of  the 
memory  module,  we  found  through  simulation  the  optimal  module  size  which  will  maximize  the 
yield  on  the  WSI  memory  system.  Our  studies  showed  that  for  a  64  Mb  and  256  Mb  memory 
system,  the  optimal  module  size  is  16  Kb,  while  the  optimal  module  size  is  64  Kb  for  a  1024 
Mb  memory  system.  The  yield  rate  of  the  WSI  memory  system  can  be  as  high  as  70%  for  the 
64  Mb  memory  system  and  40%  for  the  1024  Mb  memory  system  using  a  0.6  /xm  technology 
with  a  defect  density  of  four  defects  per  square  centimeter. 

Index  Terms — WSI  memory  system,  module-sliced,  yield  rate,  reconfiguration,  defect. 
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1  Introduction 


Due  to  the  advances  on  VLSI  technologies  and  studies  in  the  reconfigurable  fault-tolerant 
architectures  in  the  past  decades,  many  high-yield  VLSI  systems  built  on  an  entire  wafer  has 
been  reported  recently  (1,  2,  3,  4].  Especially,  a  3-D  WSI  signal  processing  system  using  indium 
bumps  for  wafer-to-wafer  interconnections  has  been  reported  to  be  able  to  stack  multi-wafers 
in  a  high  density  VLSI  system  (3,  5].  In  such  a  system,  a  WSI  memory  system  is  needed  for 
storing  the  image  data  or  supporting  the  processing  elements  on  another  wafer  through  the 
vertical  wafer-to-wafer  interconnections. 

The  conventional  row-column  memory  cells  organization  is  not  suitable  for  WSI  memory 
systems  due  to  the  long  signal  delay  on  a  wafer  and  a  much  more  complicate  procedure  for 
replacing  a  defect  row  or  column  of  memory  cell.  Moreover,  since  the  yield  of  a  module  decreases 
with  the  increase  of  its  size,  the  yield  of  a  system  can  be  improved  by  decomposing  a  large 
system  into  several  smaller  submodules,  or  using  the  module-sliced  approach.  The  submodules 
will  collectively  perform  the  function  of  the  original  large  module.  In  an  earlier  paper  [6],  we 
have  proposed  a  reconfigurable  fault-tolerant  segmented  array  processor  (RFTSAP)  structure 
to  realize  the  module-sliced  approach  for  high  yield  WSI  array  processors.  In  this  paper,  our 
fault-tolerant  WSI  memory  system  can  be  reconfigured  without  employing  any  reconfiguration 
algorithm  by  using  a  module  address  in  each  module. 

In  this  paper,  a  module-sliced  memory  architecture  is  proposed  for  high  yield  WSI  memory 
systems.  The  basic  unit  of  the  WSI  memory  system  is  a  module  which  consists  of  a  memory 
bank,  a  module  comparator  (MC),  a  module  register  (MR),  and  a  row-column  decoder.  The 
WSI  memory  system  is  organized  in  a  two  level  row/column  structure.  The  first  level  is  a  two 
dimensional  mesh  with  the  basic  unit  of  a  module.  Within  each  module,  i.e.  the  second  level, 
the  memory  bank  is  organized  in  conventional  rows  and  columns  of  memory  cells.  The  size  of 
the  memory  bank  is  a  multiple  of  4,  such  as  16  Kb,  or  64  Kb,  organized  in  rows  and  columns. 
The  actual  number  of  rows  and  columns  depends  on  the  size  of  the  memory  bank. 

Each  module  in  the  WSI  memory  system  is  a  complete  memory  system,  and  its  operation  is 
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independent  of  any  other  module.  Addressing  a  memory  cell  in  the  module-sliced  WSI  memory 
system  is  done  in  two  steps  which  can  be  processed  concurrently.  A  module  address  is  sent  to 
the  MC  to  select  a  module,  and  a  cell  address  is  sent  to  every  module  to  select  one  memory 
cell  within  each  memory  module.  If  the  memory  bank  in  a  module  has  one  or  more  defective 
cells,  a  dummy  module  address  which  will  permanently  disable  the  faulty  memory  module 
will  be  stored  in  the  MR.  On  the  other  hand,  if  a  memory  module  is  tested  as  fault-free,  an 
effective  module  address  will  be  loaded  to  the  MR  which  will  activate  the  memory  module  when 
the  MC  detects  a  match  between  the  stored  module  address  and  the  address  on  the  address 
bus.  However,  if  the  MC  or  MR,  or  the  address  and  control  signal  line  in  the  memory  module 
is  faulty,  an  entire  column  of  memory  modules  will  be  discarded  since  the  defective  memory 
module  may  affect  the  read/ write  operations  of  other  fault-free  memory  modules  in  the  same 
column;  this  is  considered  as  a  serious  defect  in  the  module-sliced  WSI  memory  system. 

Since  a  memory  module  will  be  disabled  if  there  is  more  than  one  defective  cell  in  the 
module,  a  smaller  memory  bank  in  a  module  will  generally  minimize  the  percentage  of  wasted 
memory  cells.  However,  since  the  MC,  MR,  and  the  row-column  decoder  inside  a  module  are  the 
extra  hardware  required  for  the  module-sliced  WSI  memory  system,  this  will  in  turn  decrease 
the  yield  of  the  memory  module.  Moreover,  the  probability  of  having  a  serious  defect  on  a 
memory  module  depends  on  the  relative  area  ratio  of  the  extra  hardware  to  the  memory  bank. 
A  smaller  memory  module  will  have  a  higher  probability  of  having  a  serious  defective  which 
will  destroy  an  entire  column  of  memory  modules.  Thus,  it  is  desirable  to  derive  the  optimal 
module  size  which  will  maximize  the  yield  on  the  WSI  memory  system. 

Our  analysis  showed  that  for  a  64  Mb  and  256  Mb  memory  system,  the  optimal  module  size 
is  16  Kb,  while  the  optimal  module  size  is  64  Kb  for  a  1024  Mb  memory  system.  The  yield  rate 
of  the  WSI  memory  system  can  be  as  high  as  70%  for  the  64  Mb  memory  system  and  40%  for 
the  1024  Mb  memory  system  using  a  0.6  pm  CMOS  technology  with  a  defect  density  of  four 
defects  per  square  centimeter. 

The  rest  of  this  paper  is  organized  as  follows.  The  architecture,  addressing  technique. 


and  the  procedure  to  bypass  the  faulty  modules  in  the  module-sliced  VVSI  memory  system  is 
described  in  Section  2.  The  analysis  and  derivation  of  the  optimal  module  size  based  on  a 
6-transistor  SRAM  layout  is  discussed  in  Section  3.  This  paper  concludes  with  Section  4. 

2  Architecture  and  operation  of  the  module- sliced  memory 
system 

The  basic  unit  of  the  module-sliced  WSI  memory  system  is  a  module  which  consists  of  a 
memory  bank,  a  MC,  a  MR,  and  a  row-column  decoder.  Assume  the  totaJ  memory  capacity 
of  a  memory  system  is  2^.  The  WSI  memory  system  is  organized  in  a  two  level  row/column 
structure.  The  first  level  is  a  two  dimensional  mesh  with  a  total  number  of  2"'  modules  organized 
in  a  2"*/^  x  2"*/^  (assume  m  is  an  even  number)  square  mesh.  A  module  decoding  circuitry  in 
the  first  level  is  used  to  select  one  module  out  of  the  2"‘  modules  to  load  an  effective  module 
address  to  the  MR  in  each  module.  In  each  module,  the  size  of  the  memory  bank  is  2^  where 
m  +  t  =  N.  The  memory  cells  in  the  memory  bank  is  organized  in  a  conventional  row-column 
fashion,  i.e.,  2^^^  x  2*/^  (assume  t  is  also  an  even  number).  The  MR  stores  m  bits  module 
address,  the  MR  has  two  operation  mode:  configuration  mode  and  operation  mode.  In  the 
configuration  mode,  the  most  significant  m  bits  on  the  address  bus  will  be  stored  in  the  MR. 
In  the  operation  mode,  the  content  of  the  MR  will  not  be  changed.  The  MC  compares  m  bits 
address  in  parallel  between  the  MR  (operation  mode)  and  the  most  significant  m  bits  on  the 
address  bus. 

The  row-column  decoder  decodes  the  least  significant  i  bits  on  the  address  bus  and  selects 
one  cell  from  the  memory  bank.  The  block  diagram  of  a  memory  module  is  shown  in  Fig.  1 
and  the  organization  of  a  4  x  4  modules  is  shown  in  Fig.  2. 
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2.1  Addressing  procedure  in  the  module-sliced  WSI  memory  systems 

To  address  one  memory  cell  from  the  entire  memory  system,  two  steps  addressing  process 
should  be  carried.  The  most  significant  m  bits  address  lines  will  be  sent  to  the  MC  in  each 
module  1  he  MC  compares  the  m  bits  address  stored  in  the  MR  with  the  address  lines.  If 
there  is  a  match,  the  memory  module  will  be  activated,  otherwise  the  entire  module  will  be 
disabled.  At  the  same  time,  the  row-column  decoder  in  each  module  will  receive  the  least 
significant  (  bits  address  and  an  unique  cell  will  be  selected  from  the  memory  bank.  If  every 
fault-free  memory  module  has  an  unique  module  address,  only  one  module  will  be  activated 
from  the  most  significant  m  bits  address,  thus  only  one  memory  cell  will  be  accessed  at  a  time. 
One  of  the  advantage  of  the  module-sliced  memory  system  is  that  the  MC  and  the  row-column 
decoder  in  each  module  can  operate  at  the  same  time,  thus  reducing  the  address  decoding  time 
significantly. 

2.2  Bypassing  faulty  memory  modules 

Due  to  the  imperfect  manufacturing  procedure,  it  is  almost  impossible  to  fabricate  a  WSI 
memory  system  of  the  size  more  than  64  Mb  without  any  defects.  If  no  spare  rows  or  columns 
of  memory  cells  are  provided  in  each  memory  module,  the  module  needs  to  be  bypassed  or 
permanently  disabled  if  there  is  only  one  defect  in  the  memory  bank.  Although  many  recon¬ 
figuration  algorithms  have  been  reported  to  be  able  to  reconfigure  the  WSI  systems  in  the 
presence  of  defects  [7,  8,  9],  the  defective  modules  in  the  module-sliced  WSI  memory  system 
can  be  bypassed  without  using  any  sophisticated  reconfiguration  algorithm.  Bypassing  a  faulty 
module  can  be  easily  done  by  storing  a  dummy  module  address  in  the  MR  so  that  the  module 
will  never  be  activated  whenever  an  effective  address  is  inserted. 

However,  if  the  MR  or  the  MC  are  defective,  it  is  considered  as  a  serious  defect  since 
it  is  impossible  to  disable  the  module  by  storing  a  dummy  address  to  the  MR.  Moreover,  if 
the  address  line  or  the  control  signal  line  in  the  module  is  defective,  it  is  also  impossible  to 
bypass  the  defective  module.  In  such  a  case,  an  entire  column  of  modules  must  be  discarded 
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by  disconnecting  the  power  connection  to  that  column.  In  order  to  reduce  the  number  of  good 
memory  modules  to  be  discarded  due  to  the  serious  defect,  a  complete  set  of  address  bus  and 
control  signal  lines  is  passed  to  each  column  of  memory  modules,  so  that  any  serious  defect  on 
one  column  of  memory  modules  will  only  affect  that  column,  not  the  adjacent  columns.  Note 
that  the  probability  of  having  a  serious  defect  in  a  memory  module  depends  on  the  ratio  of  the 
areas  for  the  memory  bank  and  the  extra  hardware,  such  as  the  MC,  MR,  and  the  signal  lines. 

The  configuration  of  a  2  x  4  array  out  of  a  4  x  4  array  is  shown  in  Fig.  3. 

2.3  Spare  rows  and  columns  in  the  memory  bank 

When  the  size  of  the  memory  bank  is  large,  for  instance,  larger  than  64  Kb,  it  is  found  that 
most  of  the  defects  will  fall  on  the  memory  bank  because  the  hardware  overhead  of  the  MC, 
MR,  and  control  lines  is  almost  negligible.  Thus,  it  is  desirable  to  provide  some  spare  rows 
and  columns  in  the  memory  bank,  so  that  the  memory  module  can  be  repaired  by  replacing 
the  defective  rows  or  columns  from  the  spares.  The  improvement  of  the  yield  rate  by  providing 
spare  rows  and  columns  will  be  discussed  in  the  following  section. 

3  Yield  rate  analysis 

The  yield  rate  for  a  VLSI  system  is  usually  defined  as  the  ratio  between  the  number  of 
good  chips  and  the  total  number  of  chips  fabricated  on  a  wafer.  However,  for  the  WSI  memory 
system,  the  entire  system  is  built  on  a  wafer,  the  yield  rate  definition  for  the  VLSI  system 
needs  to  be  modified  for  describing  the  yield  rate  of  the  W'SI  memory  system.  Note  that  each 
memory  module  in  the  proposed  module-sliced  memory  system  is  a  complete  system  and  every 
fault-free  memory  module  can  be  configured  into  a  functioning  memory  system  with  a  reduced 
usable  memory  capacity.  Since  a  memory  module  will  be  either  activated  if  it’s  fault-free,  or 
discarded  if  it’s  faulty,  the  ratio  of  the  number  of  fault-free  modules  to  the  total  number  of 
modules  fabricated  on  the  wafer  is  equivalent  to  the  yield  rate  definition  for  a  VLSI  system. 
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Let  the  number  of  faulty  memory  modules  be  /,  the  yield  rate  of  the  VVSl  memory  system 
‘2"‘  -  / 

is  — — — .  A  memory  module  is  considered  as  faulty  when  there  is  more  than  one  defect  in  the 
memory  bank  or  an  entire  column  of  memory  modules  will  be  considered  as  faulty  if  any  one  of 
the  memory  modules  in  the  same  column  has  a  serious  defect.  Assuming  the  defect  density  D 
is  a  constant,  such  as  4  defects  per  square  centimeter,  the  total  number  of  defects  on  an  entire 
wafer  is  bounded  by  the  product  of  D  x  A  where  A  is  the  area  of  the  wafer.  Thus,  the  smaller 
the  module  size  (smaller  /),  the  yield  rate  will  be  higher  since  2"*  =  2^/2^  will  increase  by 
reducing  t.  However,  the  relative  area  ratio  for  the  MC,  MR,  and  signal  lines  to  the  memory 
bank  will  increase  when  the  size  of  the  memory  bank  decreases,  thus  increasing  the  probability 
of  having  a  serious  defect. 

In  order  to  find  the  optimal  module  size,  we  consider  the  layout  of  a  6-transistor  SRAM 
as  an  example  in  this  study.  A  complete  VLSI  layout  of  a  2  x  2  memory  module  with  64  bits 
memory  bank  is  shown  in  Fig.  4.  From  this  figure,  it  is  easy  to  see  that  when  the  memory 
bank  size  is  very  small,  the  hardware  overhead  is  almost  40%  of  the  entire  memory  module, 
thus  dramatically  increasing  the  probability  of  having  a  serious  defect.  Based  on  the  layout  in 
Fig,  4,  we  have  calculated  the  module  size  for  4  Kb  and  64  Kb  in  a  256  Mb  memory  system  for 
1.0  to  0.4  liva.  design  technologies.  The  calculated  areas  are  given  in  Tables  1  and  2. 

3.1  Simulation  results 

Based  on  the  areas  calculated  from  the  basic  layout  of  the  64  bits  memory  module,  we  have 
randomly  generated  defects  on  the  wafer  assuming  D  =  2,  3,  4,  and  6  for  1.0  pm,  0.8  pm,  0.6 
pm,  and  0.4  pm  design  technologies,  respectively.  The  defects  are  uniformly  distributed  over 
the  entire  wafer.  If  a  defect  falls  on  a  memory  bank,  the  memory  module  is  marked  as  faulty. 
If  a  serious  defect  is  found,  all  memory  modules  in  the  same  column  are  marked  as  faulty  and 
cannot  be  repaired  by  the  spare  rows  or  columns  in  the  memory  bank. 

Figs.  5  and  6  show  the  yield  rate  of  the  module-sliced  WSI  memory  system  vs.  various 
module  size  without  any  spare  rows  or  columns  in  the  memory  bank.  It  is  found  that  the 
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optimal  module  size  of  64  Mb  and  256  Mb  memory  system  is  16  Kb  while  the  optimal  module 
size  is  64  Kb  for  the  1024  Mb  memory  system.  With  two  spare  rows  or  columns  in  the  memory 
module  (for  module  size  larger  than  64  Kb)  as  shown  in  Fig.  7,  the  optimal  module  size  for  64 
Mb  and  256  Mb  memory  shifts  to  64  Kb  and  the  yield  rate  can  be  as  high  ais  98%  for  the  64 
Mb  memory  system.  The  optimal  module  size  for  the  1024  Mb  memory  shifts  to  256  Kb  in 
this  case.  As  shown  in  Fig.  8  with  4  spare  rows  or  columns,  the  optimal  module  size  remains 
the  same  for  all  three  memory  systems.  Note  that  the  yield  rate  for  the  1024  Mb  system  is  as 
high  as  75%  with  four  spare  rows  or  columns  in  the  memory  bauK  using  the  optimal  module 
size  (256  Kb).  Several  simulations  were  run  to  study  the  effect  of  increasing  the  spare  rows 
and  columns  to  the  optimal  module  size  and  it  is  found  that  the  optimal  module  size  ranges 
between  64  Kb  and  256  Kb  with  up  to  32  spare  rows  or  columns. 

Furthermore,  we  found  that  when  the  module  size  is  less  than  16  Kb,  the  probability  of 
having  a  serious  defect  is  higher  than  10%,  thus,  the  yield  rate  will  not  be  increased  significantly 
even  if  spare  rows  or  columns  are  provided  in  the  memory  bank  since  the  serious  defect  cannot 
be  repaired  by  the  spare  memory  cells. 

4  Conclusions 

In  this  paper,  we  have  studied  a  module-sliced  WSI  memory  system  which  allows  defective 
memory  modules  to  be  easily  bypassed  from  the  fault-free  modules.  Each  memory  module  is 
a  complete  system  and  its  operation  is  independent  of  any  other  modules.  An  optimal  module 
size  which  maximizes  the  yield  rate  of  the  WSI  memory  system  is  derived  from  the  simulation 
study. 

Several  advantages  of  the  module-sliced  WSI  memory  system  are  as  follows.  First,  the 
reconfiguration  of  the  faulty  memory  system  into  a  fault-free  memory  system  is  done  straight¬ 
forward  without  employing  any  reconfiguration  algorithm  by  using  the  module  address  in  each 
module.  Second,  the  address  decoding  is  done  in  two  steps  in  parallel,  i.e.  comparison  in  the 
MC  and  decoding  in  the  row-column  decoder,  thus  reducing  the  memory  access  time.  Third,  a 
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faulty  module  can  be  easily  bypassed  by  storing  a  dummy  address  in  the  MR  if  the  defects  are 
in  the  memory  bank  without  imposing  any  complicate  reconfiguration  algorithm.  Fourth,  since 
each  module  is  a  complete  system  it  is  possible  to  achieve  parallel  testing  on  each  module,  thus 
reducing  the  testing  complexity  from  2^  to  2*  which  is  several  order  of  magnitudes  less  than 
2^. 

Several  interesting  topics  worth  to  be  further  investigated  in  the  future  are  (1)  the  modifi¬ 
cation  of  the  memory  bank  so  that  the  memory  module  can  be  read  and  write  in  a  byte,  word, 
or  double  word  width,  (2)  Addition  of  a  self-testing  circuitry  in  each  memory  module  so  that 
parallel  testing  can  be  implemented,  and  (3)  performing  a  timing  analysis  on  the  WSI  memory 
system. 
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Fig.  1.  Block  diagram  of  a  memory  module 
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Fig.  3.  Configuration  of  a  2  x  2  modules  out  of  a  4  x  4  modules 


Fig.  4.  VLSI  layout  of  a  2  x  2  module-sliced  WSI  memory  system 


Fig.  5.  Yield  rate  of  a  WSI  memory  system  (1.0  mu  technology,  no  spares) 
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